Comprehensive evaluation of the implementation of episignatures for diagnosis of neurodevelopmental disorders (NDDs)

Giuili, Edoardo; Grolaux, Robin; Macedo, Catarina Z. N. M.; Desmyter, Laurence; Pichon, Bruno; Neuens, Sebastian; Vilain, Catheline; Olsen, Catharina; Van Dooren, Sonia; Smits, Guillaume; Defrance, Matthieu

doi:10.1007/s00439-023-02609-2

Comprehensive evaluation of the implementation of episignatures for diagnosis of neurodevelopmental disorders (NDDs)

Original Investigation
Open access
Published: 27 October 2023

Volume 142, pages 1721–1735, (2023)
Cite this article

Download PDF

You have full access to this open access article

Human Genetics Aims and scope Submit manuscript

Comprehensive evaluation of the implementation of episignatures for diagnosis of neurodevelopmental disorders (NDDs)

Download PDF

Edoardo Giuili¹^na1,
Robin Grolaux¹^na1,
Catarina Z. N. M. Macedo¹,
Laurence Desmyter²,
Bruno Pichon²,
Sebastian Neuens^2,3,
Catheline Vilain^2,3,
Catharina Olsen^1,4,5,
Sonia Van Dooren^1,4,5,
Guillaume Smits^1,2,3 &
…
Matthieu Defrance¹

1988 Accesses
Explore all metrics

Abstract

Episignatures are popular tools for the diagnosis of rare neurodevelopmental disorders. They are commonly based on a set of differentially methylated CpGs used in combination with a support vector machine model. DNA methylation (DNAm) data often include missing values due to changes in data generation technology and batch effects. While many normalization methods exist for DNAm data, their impact on episignature performance have never been assessed. In addition, technologies to quantify DNAm evolve quickly and this may lead to poor transposition of existing episignatures generated on deprecated array versions to new ones. Indeed, probe removal between array versions, technologies or during preprocessing leads to missing values. Thus, the effect of missing data on episignature performance must also be carefully evaluated and addressed through imputation or an innovative approach to episignatures design. In this paper, we used data from patients suffering from Kabuki and Sotos syndrome to evaluate the influence of normalization methods, classification models and missing data on the prediction performances of two existing episignatures. We compare how six popular normalization methods for methylarray data affect episignature classification performances in Kabuki and Sotos syndromes and provide best practice suggestions when building new episignatures. In this setting, we show that Illumina, Noob or Funnorm normalization methods achieved higher classification performances on the testing sets compared to Quantile, Raw and Swan normalization methods. We further show that penalized logistic regression and support vector machines perform best in the classification of Kabuki and Sotos syndrome patients. Then, we describe a new paradigm to build episignatures based on the detection of differentially methylated regions (DMRs) and evaluate their performance compared to classical differentially methylated cytosines (DMCs)-based episignatures in the presence of missing data. We show that the performance of classical DMC-based episignatures suffers from the presence of missing data more than the DMR-based approach. We present a comprehensive evaluation of how the normalization of DNA methylation data affects episignature performance, using three popular classification models. We further evaluate how missing data affect those models’ predictions. Finally, we propose a novel methodology to develop episignatures based on differentially methylated regions identification and show how this method slightly outperforms classical episignatures in the presence of missing data.

Gene domain-specific DNA methylation episignatures highlight distinct molecular entities of ADNP syndrome

Article Open access 27 April 2019

The detection of a strong episignature for Chung–Jansen syndrome, partially overlapping with Börjeson–Forssman–Lehmann and White–Kernohan syndromes

Article Open access 24 May 2024

BAFopathies’ DNA methylation epi-signatures demonstrate diagnostic utility and functional continuum of Coffin–Siris and Nicolaides–Baraitser syndromes

Article Open access 20 November 2018

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Background

DNA methylation (DNAm) of the 5th carbon of cytosine sitting in cytosine-guanine (CpG) context is the most studied epigenetic process. Identification of differentially methylated cytosines (DMCs) is a promising area to improve diagnosis and gain knowledge on the etiology of diseases (Reichard and Zimmer-Bensch 2021; Mc Auley 2021). Unique methylation patterns allowing to discriminate between cases and control are termed episignatures (Malouf et al. 2016; Grafodatskaya et al. 2013). They are often deemed to be secondary epivariations (i.e., caused by sequence variants) as opposed to primary epivariations (i.e., caused by stochastic errors of the methylation machinery) (Horsthemke 2006). In the context of rare neurodevelopmental disorders (NDDs), episignatures in combination with classification models have emerged as a powerful tool to enable patients’ diagnosis and evaluate the pathogenicity of variants of unknown significance (VUS) (Choufani et al. 2015; Levy et al. 2022). To this date, more than 60 episignatures associated with neurodevelopmental disorders have been published. The majority of those have been developed from methylation data generated with the Illumina Infinium Methylation BeadChips 450 K and EPIC. In essence, episignature discovery derives from epigenome-wide association study (EWAS). Identification of DMCs can be done through group comparison by means of a statistical method. Classically, linear regression or Mann–Whitney U test and the addition of an effect threshold on the difference between controls and cases beta-values are applied to highlight DMCs. The resulting set of DMCs beta-values is then used to train a classification model such as support vector machines (SVM), random forest (RF) or penalized logistic regression (PLR) (Grafodatskaya et al. 2013; Choufani et al. 2015; Butcher et al. 2017). Despite their increasing popularity, few guidelines have been emitted on how to generate episignatures with the exception of an article by Chater-Diehl and colleagues that discusses the influence of the size of the training data, its origin, and effect size (Chater-Diehl et al. 2021). The absence of a gold standard methodology has led to the generation of different episignatures for similar pathologies (Choufani et al. 2015; Butcher et al. 2017; Aref-Eshghi et al. 2020), and raised the question of reproducibility and adaptability to new datasets as several key parameters affect the generation of DNAm data through Infinium BeadChips. Indeed, due to array design, it was shown that normalization plays a crucial role in data preprocessing (Dedeurwaerder et al. 2011; Bizet et al. 2022). One of the main tools to preprocess Infinium BeadChips data is the minfi package in R (Fortin et al. 2017; Aryee et al. 2014). This package includes different normalization methods for methylation data that are considered the gold standards: Beta MIxture Quantile (BMIQ) dilation normalization (Touleimat and Tost 2012), Functional normalization (Funnorm) (Fortin et al. 2014), Noob (normal-exponential out-of-band) (Triche et al. 2013), Subset-quantile Within Array Normalization (SWAN) (Maksimovic et al. 2012), as well as the reimplementation of the Illumina GenomeStudio normalization method, and a raw normalization (no normalization). Therefore, there is a need to evaluate how those different normalization methods affect episignature discovery and robustness. In addition, since the vast usage of Infinium HumanMethylation 450 K BeadChip (IM450K), there have been two iterations of the Illumina microarray: Infinium HumanMethylation EPIC BeadChip (IMEPIC) and Infinium HumanMethylation EPICv2 BeadChip (IMEPICv2). Despite high inter-array robustness and reproducibility across tissues (Cheung et al. 2020; Full article: Comparison of Illumina 450K and EPIC arrays in placental DNA methylation. xxxx; Moran et al. 2016), each new version of the arrays involves CpGs addition and removal. Therefore, CpGs from an episignature produced on one technology are not always present on another version. This issue also arises due to batch-effect related probes removal in the preprocessing steps. Those missing values require the use of imputation for the classifier to work which can lead to noise and error. An alternative strategy to overcome those limitations would be considering methylation levels in regions of consecutive DMCs called differentially methylated regions (DMRs). Indeed, DMRs have been identified in several neurodevelopmental and congenital disorders (Barbosa et al. 2018; Kagami et al. 2021) and showed diagnostic utility even in a single patient setting (Grolaux et al. 2022). Thus, using DMRs median methylation levels would mitigate the presence of missing values and is more likely to stay compatible with new microarrays iterations. Finally, three main classification models (SVM, RF, and PLR) are usually adopted to evaluate the capacity of an episignature to differentiate between cases and controls (Hannon et al. 2021). However, their performances have never been thoroughly assessed. In this work, we perform the first comprehensive evaluation of how normalization methods and classification algorithms affect prediction performance, based on two episignatures for Kabuki and Sotos syndromes (Choufani et al. 2015; Butcher et al. 2017). We further show that missing data affect classical DMC-based episignature performances and propose two new episignatures based on the detection of DMRs instead of DMCs to mitigate this effect. In conclusion, we provide researchers with guidelines on the preprocessing steps to consider when implementing new episignatures as well as a new paradigm to build DMR-episignatures (DMR-esigs) instead of the classically used DMC-based episignatures (DMC-esigs).

Results

Testing Kabuki and Sotos DMCs episignatures

The DMC-esig for Kabuki (Kabuki^DMC) and Sotos (Sotos^DMC) have been described before in two separate studies (Choufani et al. 2015; Butcher et al. 2017). They are composed of, respectively, 113 and 7,085 CpGs identified using Mann–Whitney U test and a threshold on effect size. Idat files from the original studies were normalized using 6 different methods available in minfi: Swan, Funnorm, Raw, Illumina, Quantile, and Noob. We showed that unsupervised hierarchical clustering was able to cluster cases and controls of the training sets for almost every normalization method using both signatures (Additional Figs. 1 and 2). Indeed, we noticed that both Sotos^DMC_Raw and Sotos^DMC_Swan failed to completely separate patients and controls, as two cases (GSM1920409, GSM1920415) were clustered with the controls. Moreover, those two cases are clustered together even when normalizing data with the other methods and noticeably display a different methylation profile than the rest of the case-cohort (Additional Fig. 2). Interestingly, those two samples are described in the original paper as sharing the same mutation at the end of exon 22 of NSD1, which could explain this behavior. Dimensionality reduction using UMAP allowed clear separation of the different samples for both signatures (Additional Figs. 1 and 2).

Building Kabuki and Sotos DMRs

To build DMR-esig for Sotos (Sotos^DMR) and Kabuki (KabukiDMR) syndromes, we applied the bumphunter method from champ. DMR on the training data from the two aforementioned studies normalized with the 6 different methods in minfi. For each DMR found, according to each normalized dataset, several parameters are reported in Additional Files 1 and 2 for Sotos and Kabuki syndrome, respectively, including the length of the DMR (width) and the number of CpGs included in the region (L). The resulting number of DMRs varied between both syndromes and normalization methods (Additional Fig. 3). The mean number of DMRs identified was higher in Sotos (n = 51) than in Kabuki (n = 38). However, this was mainly due to the Illumina normalized dataset that showed the highest number of DMRs in both syndromes (Kabuki = 75, Sotos = 171, Additional Fig. 3). For each dataset, the median methylation level per DMR was computed using the beta-values from the CpGs located within the region. The resulting set of median DMR methylation values constitutes the DMR-esig. We applied unsupervised hierarchical clustering and showed that similarly to DMC-esig, Sotos^DMR and KabukiDMR were able to discriminate between cases and controls using all the normalization methods (Fig. 1 and Additional Fig. 4). Notably, median methylation values of DMRs in Sotos cases were lower than in controls. UMAP dimensionality reduction techniques allowed a clear separation between all samples in the training sets (Fig. 1, Additional Figs. 4 and 5). Similarly to what is described in the section above, the two cases (GSM1920409, GSM1920415) in the Sotos cohort were still clustered together in the heatmap using Sotos^DMR, demonstrating that the DMR-esig is still able to capture the different methylation profiles of these two samples (Fig. 1). Furthermore, Kabuki samples in the UMAP generated with DMR-esigs are less separated than in the UMAP generated with DMC-esigs. This can be due to the presence of different information captured by the DMRs than the DMCs and can result in a different arrangement of points in the reduced-dimensional space.

Classification results and prediction probabilities

Results from the previous section led to the training of 18 classifiers per syndrome. We trained random forest (RF), support vector machine (SVM), and penalized logistic regression (PLR) models on all six normalized training sets, using the DMC-esig and the corresponding DMR-esig (see Methods). After performing LOOCV to find the optimal hyperparameters, the trained SVM, RF, and PLR classifiers were tested on both IM450K testing set from the original studies (Choufani et al. 2015; Butcher et al. 2017) and IMEPIC testing set from the ULB Erasme Hospital Center for Human Genetics. Performances of the classifiers were evaluated using MCC. In the case of Sotos syndrome, all models achieved a MCC of 1 when trained on the DMC-esig. However, only PLR achieved a MCC of 1 regardless of the normalization method used, showing higher robustness to normalization methods than SVM and RF. Indeed, SVM and RF achieved maximum performance only with Illumina, Noob, and Funnorm normalization methods (Fig. 2). Our models trained on DMR-esig for Sotos syndrome shared similar behavior. PLR was again the best model, showing high robustness to the normalization method and achieving an MCC of 1 for all normalization methods. SVM and RF were able to reach an MCC of 1 only with Noob and Funnorm normalization methods (Fig. 2). Concerning Kabuki syndrome, among the models trained on DMC-esig, PLR was able to reach an MCC of 1 in both testing sets for all the normalization methods, again showing overall robustness to normalization methods. SVM and RF reached an MCC of 1 when using Funnorm, Illumina, and Noob functions. Concerning our models trained on DMR-esig, when tested on both testing sets, SVM reached an MCC of 1 when using Funnorm, Illumina and Noob normalization methods. PLR was able to reach an MCC of 1 only with Noob and Illumina functions, while RF never reached an MCC of 1 (Fig. 2).

Thus, per syndrome, we extracted the best-performing model and the three best normalization methods for both DMC-esig and DMR-esig. For Sotos syndrome, PLR was the best-performing model for both DMR-esig and DMC-esig. For Kabuki syndrome, PLR and SVM were the best-performing models for DMC-esig and DMR-esig, respectively. Illumina, Noob, and Funnorm were chosen as the best normalization methods in all configurations according to the MCC computed (Fig. 2). We then compared the performances of those 6 models using testing datasets with an increasing number of missing CpGs (0, 5, 10, 20, and 30%). We showed that the DMR-based models started to outperform the DMC models at 30% of missing CpGs for both Kabuki and Sotos datasets. The higher robustness of Sotos DMC-esig is likely due to the fact that the Sotos DMC-esig is composed of a large number of probes (> 7000) and the size effect between case and controls for Sotos disease is usually much higher (> 30%) than for Kabuki (Table 1).

Table 1 A represents the mean MCC (± Std) on Kabuki Testing Sets, while B represents the mean MCC (± Std) on Sotos testing sets at different levels of missing CpGs (0 to 30%), according to the best models and normalization methods previously selected (Fig. 2). The scenario of 0% of probes removal refers to the imputation of missing probes due to low-quality signal or new array version. This highlights the higher robustness of the DMR models to missing data compared to the DMC models

Full size table

Specificity of Sotos^DMR and Kabuki^DMR episignatures

In order to verify the specificity of our models trained on DMR-esig for Sotos and Kabuki, we tested the best Sotos predictive model (PLR) on the Kabuki cohort (training set + testing sets) and the best Kabuki predictive model (SVM) on the Sotos cohort. All the samples in the Kabuki dataset were predicted as controls by the best Sotos predictive model and all the samples in the Sotos dataset were predicted as controls by the best Kabuki predictive model, regardless of the normalization method adopted, thus showing high inter-syndrome specificity (Fig. 3).

Finally, to further assess the specificity of the DMR-esig, the best predictive models for Sotos and Kabuki syndrome were tested on three separate cohorts of control samples. We used 3 separate control cohorts obtained from IM450K, IMEPIC and IMEPICv2, to assess any influence from the technology. The Sotos^PLR achieved a specificity of 100% in both IM450K, IMEPIC and IMEPICv2 samples for all three best normalization methods. Instead, the best Kabuki predictive model (Kabuki^SVM) achieved 100% of specificity only when using the model trained on data normalized with Illumina normalization method (Fig. 4). However, it is important to highlight that the other two normalization methods achieved 100% of specificity when considering only the 450 K and EPICv2 cohorts. These two cohorts of samples were indeed normalized using Illumina normalization as described in “Materials and methods”, while the cohort of EPICv1 samples was normalized using the dasen wateRmelon. This different preprocessing could explain why the EPIC prediction probability distribution is more spread and shifted toward 1. This evidence is confirmed by the previous section (Fig. 2) where, looking at the heatmap for Kabuki syndrome, it was clear that the performance of the best predictive model (SVM) is highly dependent on the normalization method adopted. Instead, the best Sotos predictive model was PLR, which is much more stable and robust to the normalization method adopted. That could partially explain the presence of false positives predicted by the best Kabuki predictive model. In addition, we applied our models to a cohort of unresolved cases previously described (see Methods). As these samples were normalized using the BMIQ method, the same considerations described above apply and some false positives can be expected. Nevertheless, two samples (GSM2366406, GSM2366710) were consistently predicted as positives by Kabuki^SVM with a prediction probability of 0.78, 0.92, 0.88 and 0.86, 0.80, 0.82 for Funnorm, Illumina, and Noob methods, respectively. On the other hand, one sample (GSM2366674) was consistently predicted as positive by the best Sotos predictive model with a prediction probability of 0.87, 0.96, and 0.87 when using Funnorm, Illumina, and Noob functions, respectively (Fig. 4A). Interestingly, those 3 patients showed a phenotype that was coherent with the predictions of our models (see Supplementary Data 1 in Barbosa et al. (2018)). Both GSM2366406 and GSM2366710 presented facial dysmorphism and congenital heart defects. In addition, GSM2366406 showed moderate intellectual disability that was not described for GSM2366710. The predicted Sotos sample (GSM2366674) was a newborn (age = 0) of short stature at the time of the study and suffered from hydrocephalus, heterotaxy and inguinal hernia.

Functional analysis of DMRs-episignature and interpretability of the models

A functional analysis was performed to elucidate the pathophysiology of Kabuki and Sotos Syndromes. The genomic locations of the DMR-esigs for Illumina, Noob, and Funnorm normalization methods show that the DMRs are located mostly in the gene bodies and promoters. Furthermore, the genomic partition analysis shows an enrichment of CpGs overlapping gene promoters and regions 1–5 kb upstream the TSS compared to the training sets used for both the Kabuki and Sotos datasets (Fig. 5 green bars). This result is consistent with previous published studies (Choufani et al. 2015; Butcher et al. 2017).

DAVID (Huang et al. 2009) was used to identify the biological processes most enriched with respect to the Kabuki and Sotos DMR-esig. We looked at the common genes between the three normalization methods for each syndrome and used them for the enrichment analysis. For the Kabuki dataset, the results showed an enrichment in genes implicated in the embryonic organ and skeletal development and morphogenesis (Additional File 3). Concerning Sotos dataset, the results highlighted an enrichment in nervous system development, system development, and developmental process, as well as cell–cell adhesion (Additional File 4), which is coherent with Sotos syndrome. These results confirmed the results already obtained in previous studies using DMC-esig instead of DMR-esig (Choufani et al. 2015; Butcher et al. 2017). Furthermore, the list of genes used for this analysis is shown in Additional File 5. Some of the HOXA cluster genes were highly relevant in the enrichment analysis of Kabuki syndrome and generally highly present in the genes overlapping the DMRs. This finding confirms previous studies (Butcher et al. 2017). On the other hand, some genes of the PCDHGA subfamily were found to overlap Sotos^DMR episignatures. This subfamily of genes encodes for large transmembrane proteins that are critical for processes such as cell-signaling (Peek et al. 2017). Furthermore, in a previous study, the researchers hypothesized that misregulation of this subfamily of proteins can contribute to the neurodevelopmental phenotype of Sotos syndrome (Bondhus et al. 2022). Taken together, these findings confirm the importance of the interpretation of episignatures and underline the capability of DMR-esigs to provide the same insights as DMC-esigs, with a higher and broader view of the mechanisms that are involved in the etiology of a disease.

Finally, the coefficients of Sotos^PLR_Noob, Sotos^PLR_Funnorm and Sotos^PLR_Illumina were obtained in order to assess the importance of the DMR predictors on the model classification. This was possible for the Sotos syndrome since PLR is an interpretable model. Concerning Kabuki syndrome, the SVM with a polynomial kernel does not return interpretable coefficients. In the Additional File 6 the features importance is reported for each of the three best normalization methods. In both Sotos^PLR_Funnorm and Sotos^PLR_Noob, the two most important DMR predictors considering the coefficient absolute value were the same and matched AIRE and ZNF833P genes. Concerning Sotos^PLR_Illumina, the two most important DMRs, yet considering the absolute value of the coefficients, were the DMR_9 and the DMR_104 which match, respectively, AIRE and TOPAZ1 genes. Regarding these genes, no important evidence that links them to Sotos syndrome has been found in the literature.

Discussion

Episignature are powerful tools to enable patient diagnosis and classification of variants of unknown significance (VUS). However, due to their relative novelty, no framework on how to build them has been described. Nevertheless, efforts from one group lead to the first guidelines on how to properly design cohorts for episignature discovery (Chater-Diehl et al. 2021). In this work, we built on this and previous work and propose a thorough analysis of how the normalization method, classification model, and missing data may affect episignature performances. We chose to evaluate the methods available in the minfi package, as it is the most popular tool to preprocess DNAm data from Infinium arrays. Indeed, several normalization methods are implemented in minfi, including the one from the equally popular Illumina GenomeStudio software. We used two episignatures for Kabuki and Sotos syndromes implemented in Epigencentral (Turinsky et al. 2020) as the training datasets were publicly available. The datasets consist of peripheral blood samples, which have exhibited a strong correlation in terms of DNAm profiling with brain regions in individuals affected by NDDs (Gregory et al. 2009; Murphy 2005). In fact, since the underlying events leading to most NDDs occur early in embryonic development, DNAm patterns associated with these syndromes have been detected in various tissues of affected individuals, including peripheral blood (Aref-Eshghi eyt al. 2020). In addition, given the practical constraints of obtaining brain tissue from living individuals, peripheral blood serves as a valuable, accessible, and minimally invasive alternative for diagnostic purposes (Aref-Eshghi 2020). However, peripheral blood samples are also characterized by a high presence of noise that can bias the results. For this reason, the entire preprocessing procedure of the samples holds an important task and must be performed carefully in order to reduce such noise. In addition, we designed a new strategy for episignature discovery based on DMRs identification. This strategy was motivated by the observation that episignatures are usually based on methylation profiles assessed by microarray technologies. Methylation arrays are subject to poor probe hybridization and new version releases that drop or add new CpG loci, which may lead to missing values. In classification tasks, the issue of missing data is usually resolved by imputation. However, this may cause noise or error in the models. This is particularly the case in a biological context, where imputation should only be performed under a strong hypothesis. We observed that with the new iterations of the Infinium arrays (IM450K; IMEPIC; IMEPICv2) there were already missing CpGs in both Kabuki and Sotos published episignatures (i.e., from IM450K to IMEPICv2: 87% concordance for Kabuki, 91% for Sotos). The new paradigm of DMR-based episignatures that we propose (see Methods) tackles this issue by considering the median methylation of the CpGs within a region, therefore decreasing the reliance on individual loci. Unsupervised hierarchical clustering is a classical way to assess the capacity of a signature to differentiate between cases and controls. Using this method, we saw that both types of episignatures (both DMR-esig and DMP-esig), for all the six normalization methods, were able to correctly cluster the Sotos and the Kabuki training sets. In addition, we showed that the DMR-esig retained the ability to further refine the clustering of patients showing a different profile within a syndrome. We then explored the performances of the three most commonly used classification models. SVMs are usually the model of choice for episignatures classification tasks (Choufani et al. 2015; Butcher et al. 2017; Aref-Eshghi et al. 2020; Turinsky et al. 2020; Verberne et al. 2022; Coenen-van der Spek, et al. 2023). However, there is no existing work on how other models performed in comparison. Therefore, we chose to implement SVM, RF, and PLR models and evaluated their performances using the MCC. We showed that in both DMC-esig, PLR was the only classifier reaching an MCC of 1 in every setting, whereas SVM or RF classifiers’ performances depended on the method used to normalize data. Using DMR-esig, PLR showed the overall best performance on the Sotos testing cohort, while SVM showed the best performance on the Kabuki testing cohort. These results may be due to the fact that those two syndromes differ in the size effect in methylation levels between patients and control. PLR may perform better to discriminate between large effects where SVM performs better with less strong differential methylation. We also highlighted that model performances on training sets normalized with the Raw, Swan, or Quantile methods were almost always worse than Illumina, Noob, or Funnorm. Therefore, we would advise against the use of those methods during data preprocessing. We then analyzed the influence of missing data on the model performances. As discussed above, missing data may represent an increasing challenge when episignatures are developed on deprecated technologies. We assessed the performances of DMC-esig and DMR-esig in the testing sets showing increased missing data (5, 10, 20 and 30% of missing probes). Our results indicated that both DMC-esig and DMR-esig are valid options in a classification task in the context of rare disorders affecting DNAm. However, with an increasing percentage of missing probes in the testing set, the models based on DMR-esig slightly outperformed the ones based on DMC-esig in terms of mean MCC. These results highlight that DMR-based episignatures are more robust to missing data. Therefore, we suggest the readers to envision this approach when building episignatures. In addition, DMR-based episignatures are less likely to become deprecated as new technologies to measure DNA methylation become available (e.g., reduced representation bisulfite sequencing (RRBS), enzymatic methyl-seq (EM-seq), Oxford Nanopore Technologies). Nevertheless, we want to draw attention to the fact that other methods for DMR detection exist (Mallik 2019; Zheng et al. 2022). Those methods have been shown to outperform bumphunter and could be valid alternatives in the implementation of DMR-esigs. In the two syndromes presented here, we saw that using DMRcate (Peters et al. 2015), the best-performing software in (Mallik 2019), to identify DMRs did not lead to significant change in performance. It slightly increased the MCC on the testing sets when using Illumina, Noob and Funnorm normalization methods (Additional Fig. 6), slightly decreased when using Raw, Swan and Quantile normalization methods (Additional Fig. 6) and slightly decreased the specificity on the GSE40279 control cohort (0.98, 0.98 and 0.99 when using Kabuki^SVM_Illumina, Kabuki^SVM_Noob and Kabuki^SVM_Funnorm, respectively, while it remains equal to 1 when using Sotos^PLR_Illumina, Sotos^PLR_Noob and Sotos^PLR_Funnorm classifiers). This could be due to the fact that the effect size in those two syndromes were close to the threshold at which bumphunter performs best (Mallik 2019; Zheng et al. 2022). In addition, we saw that bumphunter-based DMR-esig classifiers performed well. However, we would advise comparing the performances of those different methods when implementing a DMR-esig in other disorders.

Conclusions

This work highlights three key points that are important to consider when building a new episignature. First, the choice of the best model is highly dependent on the dataset, and particularly on the type of disease and features. For this reason, it is better to provide a unique model that, during the building procedure, was shown to be the most accurate and robust to missing data and normalization methods. In addition, in the context of clinical diagnosis, it is important to consider the interpretability of the model, and it is always better to choose an interpretable model (e.g., linear SVM, PLR) when possible. Indeed, being able to access features importance allows to better characterize the underlying biological mechanisms of the episignature. Second, the choice of normalization method is necessary and must be done carefully when training a model. Overall, from our analysis, it is preferable to use Funnorm, Noob or Illumina as normalization functions and avoid the use of Raw, Quantile, and Swan. The first three have been shown to maximize the accuracy independently of the type of model or disease. Finally, the use of DMRs instead of DMCs as features to train a classification model has been demonstrated to be a valid, accurate, and robust alternative to currently adopted episignatures. This new approach does not need any imputation algorithm to deal with missing data, nor, when a new Illumina array version is released, to repeat the procedure of building a new model that is trained on common CpGs between the old and the new array version.

Materials and methods

Subjects and cohort

The study includes peripheral blood from 26 samples collected at the ULB Erasme Hospital Center for Human Genetics (Brussels, Belgium). Informed consent was obtained from all research participants according to the protocol approved by the Ethics committee of the Erasme Hospital. These are composed of 22 controls, 1 patient with a confirmed diagnosis of Kabuki syndrome (OMIM: #147920) and 3 patients with a confirmed diagnosis of Sotos syndrome (OMIM: #117550) through the identification of class IV or V variants by genetic testing. Control specimens were healthy individuals without any developmental delay, intellectual disability, or congenital anomalies. Additional peripheral blood datasets were retrieved from the gene expression omnibus. They include data used to generate the Kabuki (IM450K, GSE97362) and Sotos (IM450K, GSE74432) episignatures (Choufani et al. 2015; Butcher et al. 2017), a cohort of 489 unresolved cases (IM450K, GSE89353) (Barbosa et al. 2018), 521 control samples (IMEPIC, GSE152026) (Noguera-Castells et al. 2023), 656 control samples (IM450K, GSE40279) and 4 additional control samples (IMEPICv2, GSE222919) (Teschendorff et al. 2013).

DNA methylation experiment

Whole-blood DNA samples were bisulfite converted using the EZ DNA Methylation™ Kit from Zymo Research according to manufacturer standard procedure. The sodium bisulfite converted DNA was then hybridized to the Illumina Infinium MethylationEPIC v1.0 BeadChip to interrogate over 850,000 CpG sites in the human genome. The scan of the array has been done by the BRIGHTcore facility of UZ Brussels.

Training and testing sets

Training and testing sets for the Kabuki and Sotos episignatures were derived from datasets GSE97362 and GSE74432, respectively. Additional Files 7 and 8 contain the GEO ID of the training sets for the Kabuki and Sotos cohort, respectively. The training sets were balanced with the same number of control and case samples (Table 2). Two additional testing sets composed of IMEPIC samples generated at the ULB Erasme Hospital Center for Human Genetics (Brussels, Belgium) were used to assess the model performances on external IMEPIC data (Table 2).

Table 2 Description of the three datasets used in this work. Two testing sets based on 450 K and EPIC samples have been used to test the performances of the classifiers. One training set based on 450 K samples has been used to train the models

Full size table

The Sotos case training set is composed of 8 females and 11 males, with an average age of 11.67 ± 10.97 years (range 0–41 years), while the control group is composed of 12 females and 7 males with an average age at sample collection of 10.94 ± 5.45 years (range 3–18 years) (Additional File 8). The Kabuki case training group is composed of two females and nine males with an average age at sample collection of 13.64 ± 5.64 years (range 1–20 years), while the control group is composed of two females and nine males with an average age at sample collection of 10 ± 3.43 years (range 5.5–13 years) (Additional File 7). The testing cohort includes the rest of the respective datasets without the samples with variants of unknown significance (VUS), as well as the data from Erasme Hospital.

Data preprocessing

Data from the original articles and in-house data were processed using the minfi library in R (Aryee et al. 2014). Probes with low-quality signals (detection p value > 0.01), cross-reactive probes, probes harboring single nucleotide polymorphisms (SNPs), and probes on sex chromosomes were removed. The cross-reactive probes were removed using the R library maxprobes. The total number of probes retained for each dataset is shown in Table 3. Normalization of the data was performed using the 6 different methods available in minfi using default parameters: Noob (Triche et al. 2013), Illumina, Funnorm (Fortin et al. 2014), Raw (no normalization), Quantile (Touleimat and Tost 2012) and Swan (Maksimovic et al. 2012). The Illumina normalization method implemented in minfi is the same method used in Choufani et al. (2015) and Butcher et al. (2017) to normalize the beta-values. The normalization, as for the preprocessing, was performed separately on training and testing sets in order to reduce the bias and avoid overestimating the accuracy of the classifiers in the testing sets. Quality Control (QC) of the samples was assessed using minfi package (Aryee et al. 2014): no bad quality samples were found, thus, all samples were kept for the analysis. We used the Illumina annotations based on the hg19 genome version in order to have the same, comparable positions and annotations for all array versions.

Table 3 Nnumber of probes remaining after preprocessing and filtering together with their percentages

Full size table

Differentially methylated regions (DMRs) episignatures

To identify DMRs in the Kabuki and Sotos training sets, we used the ChAMP library available in R on the training set (Tian et al. 2017). Particularly, the DMR function has been adopted specifying bumphunter as DMRs detection function (Jaffe et al. 2012). The following non-default parameters were used: number of bootstraps = 1000; minimum number of probes per DMR: 10. For both syndromes, this function was run on the training set 6 times: once per normalization method. The cutoff parameter was decided by the software, except for the Sotos dataset, where we applied a more stringent filter in order to reduce the number of DMRs found. Thus, the cutoff parameter was set as 0.7 for Sotos syndrome and kept as 0.292 for Kabuki syndrome. Only DMRs with an adjusted p value < 0.05 were retained. Once the DMRs were computed, the median methylation value of the probes included in each DMR was extracted and considered as the methylation value for that DMR. The methylation values of the DMRs were computed separately for the training and the testing sets.

Hierarchical clustering and UMAP

Hierarchical clustering was performed using the seaborn clustermap function, with average as linkage method and cityblock (manhattan) as the distance metric to cluster samples, which is preferred for Euclidean distance for high-dimensional application (Aggarwal et al. 2001). Python umap package was used to visualize the samples in a two-dimensional space, using beta-values from the DMC-esig or median beta-values from the DMRs in the DMR-esig, respectively. Data were scaled with Python scikit-learn StandardScaler function prior to using UMAP.

Machine learning classifiers for discriminating between syndrome and non-syndrome samples

Support Vector Machine (SVM), Random Forest (RF), and Logistic Regression regularized with penalization term (PLR) were trained either on the beta-values of the DMC-esig, or on the median beta-values of the DMRs constituting the DMR-esig. Moreover, for each syndrome, the models were trained on each of the six normalized training sets as described above. The optimization algorithm of PLR was set as liblinear within the solver parameter, while we kept the default L2 penalty with a dual formulation. The hyperparameters were specified within a GridSearch Analysis and the optimal ones were chosen through a leave-one-out cross-validation (LOOCV) (Additional File 9). The classifiers were then tested on the test sets to assess their performances on new, unknown samples. The prediction probabilities were also extracted from the results. Matthew’s correlation coefficient (MCC) was used to evaluate the models’ performances. Indeed, MCC is preferred to the F1 score in binary classification tasks due to its higher robustness to imbalanced datasets (Chicco and Jurman 2020). Model training, testing, and performance assessment were done using scikit-learn. The procedure adopted to build classification models for both Sotos and Kabuki syndrome is summarized in Fig. 6.

Testing robustness for missing probes

To avoid missing values in the DMC-esig, nan resulting from CpGs removal (due to new array version and/or low-quality signal probes) were imputed using pandas.fillna(method = ‘ffill’). In order to test the robustness of the different classifiers, we also removed from all the testing datasets 5, 10, 20 or 30% of CpGs from the whole array after normalization. We then imputed the missing probes with the same procedure adopted above. This process simulates a scenario in which the release of new technology and the removal of low-quality signal probes introduce missing values in the samples to be classified with existing models. In the DMR-esig, no imputation was performed and for each DMR the median methylation value has been computed on the remaining probes within that DMR. Then, prediction performances and probabilities for each model have been extracted in order to evaluate the robustness of both approaches. We repeated this process ten times, using different random seeds to select the CpGs to remove and computed the mean and the standard deviation of the MCC from all individual performance evaluations.

Validation of the DMR-esig models on external datasets

To assess the specificity of the models trained on DMR-esig across all array versions, a cohort of 521 IMEPIC control samples (GSE152026), 656 IM450K control samples (GSE40279) and 4 additional IMEPICv2 control samples (GSE222919) were used. Furthermore, the models were applied to a cohort of 489 IM450K unresolved cases (GSE89353). Data from GSE89353, GSE222919, GSE152026, and GSE40279 were retrieved already preprocessed and normalized using BMIQ (Quantile), Illumina GenomeStudio, the wateRmelon dasen function and Illumina GenomeStudio, respectively, as described in the original studies (Barbosa et al. 2018; Noguera-Castells et al. 2023; Teschendorff et al. 2013; Hannum et al. 2013). Beta-values were used as such without any further processing.

Functional analysis

All analyses were performed using R (version 4.1.2). For both Kabuki and Sotos DMR-esigs using Illumina, Noob and Funnorm normalization methods, we extracted the CpGs belonging to the DMRs of each episignature. Then, the CpGs were annotated to the genome assembly GRCh37 (hg19) using annotatr (v. 1.24.0). Genic annotations included regions 1-5 Kb upstream of the transcription start site (TSS), gene promoters (< 1 Kb upstream of the TSS), intergenic regions, 5’UTRs and 3’UTRs, exons, introns and intron–exon boundaries, using the following built-in annotations: ‘hg19_basicgenes’, ‘hg19_genes_intergenic’ and ‘hg19_genes_intronexonboundaries’. The UTRs, exons, introns and intron–exon boundaries were considered part of the “gene body” category. The CpGs genomic annotations of the episignatures were then compared to the CpGs genomic annotations of the entire corresponding training set in order to understand which regions are most enriched in the DMR/esigs.

Gene ontology enrichment analysis was performed using DAVID (Huang et al. 2009) to identify the biological processes most enriched. The genes used for the enrichment analysis were only the in-common genes among the DMR^Illumina, DMR^Noob and DMR^Funnorm episignatures. Enriched terms were filtered at adjusted P value < 0.05 (Benjamini–Hochberg procedure).

Software versions

Analyses were performed either in Python (version: 3.8.10) or R (version 4.1.2).

The following version of the packages were used: minfi (version 1.40.0); maxprobes (version 0.0.2); ChAMP (version 2.24.0); seaborn (version 0.12.2); umap (version 0.5.3); scikit-learn (version: 1.2.2); pandas (version: 1.5.3).

Availability of data and materials

Public datasets used in the study are described in the Methods section. Data generated in Erasme Hospital can be obtained from the corresponding author upon a reasonable request.

Abbreviations

DNAm:: DNA methylation
DMC:: Differentially methylated cytosine
DMC-esig:: DMC-based episignature
DMR:: Differentially methylated region
DMR-esig:: DMR-based episignature
EWAS:: Epigenome-wide association study
IM450K:: Infinium HumanMethylation 450 K BeadChip
IMEPIC:: Infinium HumanMethylation EPIC BeadChip
IMEPICv2:: Infinium HumanMethylation EPICv2 BeadChip
LOOCV:: Leave-one-out cross-validation
MCC:: Matthew’s correlation coefficient
NDD:: Neurodevelopmental disorder
PLR:: Penalized logistic regression
QC:: Quality control
RF:: Random forest
SNP:: Single nucleotide polymorphisms
SVM:: Support vector machine
VUS:: Variant of unknown significance

References

Aggarwal CC, Hinneburg A, Keim DA (2001) On the surprising behavior of distance metrics in high dimensional space. In: Van den Bussche J, Vianu V (eds) Database theory—ICDT 2001. Springer, New York, pp 420–434. https://doi.org/10.1007/3-540-44503-X_27
Chapter Google Scholar
Aref-Eshghi E et al (2020) Evaluation of DNA methylation episignatures for diagnosis and phenotype correlations in 42 mendelian neurodevelopmental disorders. Am J Hum Genet 106:356–370
Article CAS PubMed PubMed Central Google Scholar
Aryee MJ et al (2014) Minfi: a flexible and comprehensive bioconductor package for the analysis of infinium DNA methylation microarrays. Bioinformatics 30:1363–1369
Article CAS PubMed PubMed Central Google Scholar
Barbosa M et al (2018) Identification of rare de novo epigenetic variations in congenital disorders. Nat Commun 9:2064
Article PubMed PubMed Central Google Scholar
Bizet M et al (2022) Improving Infinium MethylationEPIC data processing: re-annotation of enhancers and long noncoding RNA genes and benchmarking of normalization methods. Epigenetics. 17:2434–2454
Article PubMed PubMed Central Google Scholar
Bondhus L, Wei A, Arboleda VA (2022) DMRscaler: a scale-aware method to identify regions of differential DNA methylation spanning basepair to multi-megabase features. BMC Bioinformatics 23:364
Article CAS PubMed PubMed Central Google Scholar
Butcher DT et al (2017) CHARGE and Kabuki syndromes: gene-specific DNA methylation signatures identify epigenetic mechanisms linking these clinically overlapping conditions. Am J Hum Genet 100:773–788
Article CAS PubMed PubMed Central Google Scholar
Chater-Diehl E et al (2021) Anatomy of DNA methylation signatures: emerging insights and applications. Am J Hum Genet 108:1359–1366
Article CAS PubMed PubMed Central Google Scholar
Cheung K, Burgers MJ, Young DA, Cockell S, Reynard LN (2020) Correlation of infinium HumanMethylation450K and MethylationEPIC BeadChip arrays in cartilage. Epigenetics 15:594–603
Article PubMed Google Scholar
Chicco D, Jurman G (2020) The advantages of the Matthews correlation coefficient (MCC) over F1 score and accuracy in binary classification evaluation. BMC Genomics 21:6
Article PubMed PubMed Central Google Scholar
Choufani S et al (2015) NSD1 mutations generate a genome-wide DNA methylation signature. Nat Commun 6:10207
Article CAS PubMed Google Scholar
Dedeurwaerder S et al (2011) Evaluation of the infinium methylation 450K technology. Epigenomics 3:771–784
Article CAS PubMed Google Scholar
Fortin J-P et al (2014) Functional normalization of 450k methylation array data improves replication in large cancer studies. Genome Biol 15:503
Article PubMed PubMed Central Google Scholar
Fortin J-P, Triche TJ, Hansen KD (2017) Preprocessing, normalization and integration of the illumina HumanMethylationEPIC array with minfi. Bioinformatics 33:558–560
Article CAS PubMed Google Scholar
Fernandez-Jimenez N et al (2019) Comparison of Illumina 450K and EPIC arrays in placental DNA methylation. Epigenetics 14(12):1177–1182. https://doi.org/10.1080/15592294.2019.1634975
Article PubMed PubMed Central Google Scholar
Grafodatskaya D et al (2013) Multilocus loss of DNA methylation in individuals with mutations in the histone H3 Lysine 4 Demethylase KDM5C. BMC Med Genomics 6:1
Article CAS PubMed PubMed Central Google Scholar
Gregory SG, Connelly JJ, Towers AJ, Johnson J, Biscocho D, Markunas CA, Lintas C, Abramson RK, Wright HH, Ellis P, Langford CF, Worley G, Delong GR, Murphy SK, Cuccaro ML, Persico A, Pericak-Vance MA (2009) Genomic and epigenetic evidence for oxytocin receptor deficiency in autism. BMC Med 22(7):62. https://doi.org/10.1186/1741-7015-7-62
Article CAS Google Scholar
Grolaux R et al (2022) Identification of differentially methylated regions in rare diseases from a single-patient perspective. Clin Epigenetics 14:174
Article CAS PubMed PubMed Central Google Scholar
Hannon E et al (2021) DNA methylation meta-analysis reveals cellular alterations in psychosis and markers of treatment-resistant schizophrenia. Elife 10:e58430
Article CAS PubMed PubMed Central Google Scholar
Hannum G et al (2013) Genome-wide methylation profiles reveal quantitative views of human aging rates. Mol Cell 49:359–367
Article CAS PubMed Google Scholar
Horsthemke B (2006) Epimutations in human disease. In: Doerfler W, Böhm P (eds) DNA methylation: development, genetic disease and cancer. Springer, New York, pp 45–59. https://doi.org/10.1007/3-540-31181-5_4
Chapter Google Scholar
Huang DW, Sherman BT, Lempicki RA (2009) Systematic and integrative analysis of large gene lists using DAVID bioinformatics resources. Nat Protoc 4:44–57
Article CAS PubMed Google Scholar
Jaffe AE et al (2012) Bump hunting to identify differentially methylated regions in epigenetic epidemiology studies. Int J Epidemiol 41:200–209
Article PubMed PubMed Central Google Scholar
Kagami M et al (2021) ZNF445: a homozygous truncating variant in a patient with Temple syndrome and multilocus imprinting disturbance. Clin Epigenetics 13:119
Article CAS PubMed PubMed Central Google Scholar
Levy MA et al (2022) Novel diagnostic DNA methylation episignatures expand and refine the epigenetic landscapes of Mendelian disorders. HGG Adv 3:100075
CAS PubMed Google Scholar
Maksimovic J, Gordon L, Oshlack A (2012) SWAN: Subset-quantile within array normalization for illumina infinium HumanMethylation450 BeadChips. Genome Biol 13:R44
Article PubMed PubMed Central Google Scholar
Mallik S, Odom GJ, Gao Z, Gomez L, Chen X, Wang L (2019) An evaluation of supervised methods for identifying differentially methylated regions in illumina methylation arrays. Brief Bioinf. 20(6):2224–2235. https://doi.org/10.1093/bib/bby085
Article CAS Google Scholar
Malouf GG et al (2016) DNA methylation signature reveals cell ontogeny of renal cell carcinomas. Clin Cancer Res 22:6236–6246
Article CAS PubMed PubMed Central Google Scholar
Mc Auley MT (2021) DNA methylation in genes associated with the evolution of ageing and disease: a critical review. Ageing Res Rev 72:101488
Article CAS PubMed Google Scholar
Moran S, Arribas C, Esteller M (2016) Validation of a DNA methylation microarray for 850,000 CpG sites of the human genome enriched in enhancer sequences. Epigenomics 8:389–399
Article CAS PubMed Google Scholar
Murphy BC, O’Reilly RL, Singh SM (2005) Site-specific cytosine methylation in S-COMT promoter in 31 brain regions with implications for studies involving schizophrenia. Am J Med Genet B Neuropsychiatr Genet 133B(1):37–42. https://doi.org/10.1002/ajmg.b.30134
Article PubMed Google Scholar
Noguera-Castells A, García-Prieto CA, Álvarez-Errico D, Esteller M (2023) Validation of the new EPIC DNA methylation microarray (900K EPIC v2) for high-throughput profiling of the human DNA methylome. Epigenetics. 18:2185742
Article PubMed PubMed Central Google Scholar
Peek SL, Mah KM, Weiner JA (2017) Regulation of neural circuit formation by protocadherins. Cell Mol Life Sci 74:4133–4157
Article CAS PubMed PubMed Central Google Scholar
Peters TJ, Buckley MJ, Statham AL et al (2015) De novo identification of differentially methylated regions in the human genome. Epigenet Chromatin 8:6. https://doi.org/10.1186/1756-8935-8-6
Article CAS Google Scholar
Reichard J, Zimmer-Bensch G (2021) The epigenome in neurodevelopmental disorders. Front Neurosci. https://doi.org/10.3389/fnins.2021.776809
Article PubMed PubMed Central Google Scholar
Teschendorff AE et al (2013) A beta-mixture quantile normalization method for correcting probe design bias in Illumina Infinium 450 k DNA methylation data. Bioinformatics 29:189–196
Article CAS PubMed Google Scholar
Tian Y et al (2017) ChAMP: updated methylation analysis pipeline for illumina BeadChips. Bioinformatics 33:3982–3984
Article CAS PubMed PubMed Central Google Scholar
Touleimat N, Tost J (2012) Complete pipeline for infinium® human methylation 450K BeadChip data processing using subset quantile normalization for accurate DNA methylation estimation. Epigenomics 4:325–341
Article CAS PubMed Google Scholar
Triche TJ, Weisenberger DJ, Van Den Berg D, Laird PW, Siegmund KD (2013) Low-level processing of Illumina Infinium DNA methylation beadarrays. Nucleic Acids Res 41:e90
Article CAS PubMed PubMed Central Google Scholar
Turinsky AL et al (2020) EpigenCentral: portal for DNA methylation data analysis and classification in rare diseases. Hum Mutat 41:1722–1733
Article CAS PubMed Google Scholar
van der Coenen Spek J et al (2023) DNA methylation episignature for Witteveen-Kolk syndrome due to SIN3A haploinsufficiency. Genet Med. 25:63–75
Article Google Scholar
Verberne EA et al (2022) DNA methylation signature for JARID2-neurodevelopmental syndrome. Int J Mol Sci 23:8001
Article CAS PubMed PubMed Central Google Scholar
Zheng Y, Lunetta KL, Liu C, Katrinli S, Smith AK, Miller MW, Logue MW (2022) An evaluation of the genome-wide false positive rates of common methods for identifying differentially methylated regions using illumina methylation arrays. Epigenetics. 17(13):2241–2258. https://doi.org/10.1080/15592294.2022.2115600
Article PubMed PubMed Central Google Scholar

Download references

Acknowledgements

We thank the family members who provided samples and clinical information for this study.

Funding

Part of this research was financed by Innoviris in the context of the BRIDGE IGencare project 2017 PFS-11e IgenCare (ULB) and BRGIMP12 (VUB). RG is a PhD fellow of the Fonds pour la Formation à la Recherche dans l’Industrie et l’Agronomie (F.N.R.S-FRIA, Belgium) (1.E.013.20F). RG is a beneficiary of the Fonds David et Alice Van Buuren, Fondation Jaumotte-Demoulin and the Fondation Héger-Masson.

Author information

Edoardo Giuili and Robin Grolaux contributed equally to this work.

Authors and Affiliations

Interuniversity Institute of Bioinformatics in Brussels, Université Libre de Bruxelles-Vrije Universiteit Brussel, Brussels, Belgium
Edoardo Giuili, Robin Grolaux, Catarina Z. N. M. Macedo, Catharina Olsen, Sonia Van Dooren, Guillaume Smits & Matthieu Defrance
Center for Human Genetics, Hôpital Erasme, Hôpital Universitaire de Bruxelles, Université Libre de Bruxelles, Brussels, Belgium
Laurence Desmyter, Bruno Pichon, Sebastian Neuens, Catheline Vilain & Guillaume Smits
Department of Genetics, Hôpital Universitaire Des Enfants Reine Fabiola, Hôpital Universitaire de Bruxelles, Université Libre de Bruxelles, Brussels, Belgium
Sebastian Neuens, Catheline Vilain & Guillaume Smits
Clinical Sciences, Research Group Reproduction and Genetics, Brussels Interuniversity Genomics High Throughput Core (BRIGHTcore), Vrije Universiteit Brussel (VUB), Universitair Ziekenhuis Brussel (UZ Brussel), Brussels, Belgium
Catharina Olsen & Sonia Van Dooren
Clinical Sciences, Research Group Reproduction and Genetics, Centre for Medical Genetics, Vrije Universiteit Brussel (VUB), Universitair Ziekenhuis Brussel (UZ Brussel), Brussels, Belgium
Catharina Olsen & Sonia Van Dooren

Authors

Edoardo Giuili
View author publications
You can also search for this author in PubMed Google Scholar
Robin Grolaux
View author publications
You can also search for this author in PubMed Google Scholar
Catarina Z. N. M. Macedo
View author publications
You can also search for this author in PubMed Google Scholar
Laurence Desmyter
View author publications
You can also search for this author in PubMed Google Scholar
Bruno Pichon
View author publications
You can also search for this author in PubMed Google Scholar
Sebastian Neuens
View author publications
You can also search for this author in PubMed Google Scholar
Catheline Vilain
View author publications
You can also search for this author in PubMed Google Scholar
Catharina Olsen
View author publications
You can also search for this author in PubMed Google Scholar
Sonia Van Dooren
View author publications
You can also search for this author in PubMed Google Scholar
Guillaume Smits
View author publications
You can also search for this author in PubMed Google Scholar
Matthieu Defrance
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

Conceptualization: MD, EG, RG; patient data production and curation: LD, GS, SN, CV, BP; formal analysis: EG, RG, MD, CM; methodology: EG, RG, CM; writing-original draft: EG, RG, CM; writing/review and editing: EG, RG, MD, CO, GS. All the authors read and approved the final manuscript.

Corresponding author

Correspondence to Matthieu Defrance.

Ethics declarations

Conflict of interest

The authors declare that they have no competing interests.

Ethical approval and consent to participate

The study includes peripheral blood from 26 samples collected at the Erasme Hospital Genetic Center (Brussels, Belgium). Informed consent was obtained from all research participants according to the protocols Q2015/003 and P2019/436 approved by the Ethics committee of the Erasme Hospital.

Consent for publication

Not applicable.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Below is the link to the electronic supplementary material.

Supplementary file1 (ZIP 4431 KB)

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Giuili, E., Grolaux, R., Macedo, C.Z.N.M. et al. Comprehensive evaluation of the implementation of episignatures for diagnosis of neurodevelopmental disorders (NDDs). Hum. Genet. 142, 1721–1735 (2023). https://doi.org/10.1007/s00439-023-02609-2

Download citation

Received: 28 July 2023
Accepted: 10 October 2023
Published: 27 October 2023
Issue Date: December 2023
DOI: https://doi.org/10.1007/s00439-023-02609-2

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Comprehensive evaluation of the implementation of episignatures for diagnosis of neurodevelopmental disorders (NDDs)

Abstract

Similar content being viewed by others

Gene domain-specific DNA methylation episignatures highlight distinct molecular entities of ADNP syndrome

The detection of a strong episignature for Chung–Jansen syndrome, partially overlapping with Börjeson–Forssman–Lehmann and White–Kernohan syndromes

BAFopathies’ DNA methylation epi-signatures demonstrate diagnostic utility and functional continuum of Coffin–Siris and Nicolaides–Baraitser syndromes

Background

Results

Testing Kabuki and Sotos DMCs episignatures

Building Kabuki and Sotos DMRs

Classification results and prediction probabilities

Specificity of SotosDMR and KabukiDMR episignatures

Functional analysis of DMRs-episignature and interpretability of the models

Discussion

Conclusions

Materials and methods

Subjects and cohort

DNA methylation experiment

Training and testing sets

Data preprocessing

Differentially methylated regions (DMRs) episignatures

Hierarchical clustering and UMAP

Machine learning classifiers for discriminating between syndrome and non-syndrome samples

Testing robustness for missing probes

Validation of the DMR-esig models on external datasets

Functional analysis

Software versions

Availability of data and materials

Abbreviations

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Ethical approval and consent to participate

Consent for publication

Additional information

Publisher's Note

Supplementary Information

Supplementary file1 (ZIP 4431 KB)

Rights and permissions

About this article

Cite this article

Share this article

Search

Navigation

Specificity of Sotos^DMR and Kabuki^DMR episignatures