Background

Rare diseases affect approximately 400 million people worldwide. Several of them are characterized by extreme clinical and genetic heterogeneity that slows down the process of etiological diagnosis for individual patients. Over the past few years, the increasing digitalization of health-related data brings the opportunity to reuse clinical data to get new insights regarding the management of diseases and to reduce the risks of mis- and delayed diagnosis [1, 2]. One solution to accelerate the diagnosis process is to rely on patients’ electronic health records (EHRs) for automated phenotyping [3] and develop algorithms to match patients’ profiles based on their phenotypes. In other words, EHRs of already diagnosed patients can be reused to suggest an automated phenotype-driven diagnosis for undiagnosed cases [4]. We have shown that clinicians mainly use narrative documents to report on signs, symptoms and comorbidities [5, 6] and that it provided valuable information about the natural histories of rare diseases like Myrhe [7] and Dravet [8, 9] syndromes. Nowadays, several studies use phenotypic information that is present in text and most machine learning models use automated natural language processing (NLP) systems to extract clinically relevant information from EHRs [10], for example for pediatric [11] or kidney diseases [12, 13]. Noaeen et al. [14] confirmed the significance of incorporating NLP-based patient-specific information into prediction models, leading to a notable improvement in the discriminatory power and robustness of the ML algorithms. This is of major importance for rare diseases as early diagnosis can result in better management and progression of the disease [15]. To achieve this goal, the Imagine Institute has developed Dr. Warehouse, a biomedical data warehouse that integrates a Natural Language Processing (NLP) pipeline [16] that enables the extraction of clinical phenotypes from unstructured clinical notes from Necker-Enfants Malades Hospital. We showed that mining Dr. Warehouse accelerated the diagnosis of Dravet syndrome [8] and identified two undiagnosed cases harboring the same de novo variant in KCNA2 [17]. These two examples demonstrate the utility of this approach to accelerate the diagnosis of difficult cases. Now, our objective is to adapt it to complex and pleiotropic rare diseases such as ciliopathies.

Ciliopathies are an expanding group of more than fifty severe and rare genetic disorders associated with pathogenic variants that result in the abnormal formation or function of cilia. Renal manifestations are common features in ciliopathies [18]. Among renal ciliopathies, recessive nephronophthisis (NPH) is characterized by chronic tubulointerstitial nephritis and the development of massive renal fibrosis and cysts leading to end-stage kidney disease (ESKD), which typically occurs in childhood but also later in adults. NPH can be observed as an isolated condition or as part of multiorgan disorders such as Senior-Løken syndrome or Joubert syndrome. Biallelic variants of the NPHP1 gene are the main cause of pediatric NPH. Recent studies have demonstrated that renal ciliopathies are largely underdiagnosed, as high rates of NPHP1 pathogenic variants (mostly deletion) have been discovered in adults with chronic kidney disease [19] (CKD). Being able to diagnose ciliopathy patients as early as possible is crucial so that they can have access to appropriate support and personalized care and benefit from potential future treatments. Indeed, a potential treatment for NPHP1-associated renal ciliopathies has been recently investigated with promising results [20]. Such a treatment might be useful to prevent the onset of severe CKD for patients at an early stage but is also expected to slow down the evolution of CKD for patients with kidney lesions. Diagnosing patients regardless of the CKD severity is also important for patients’ families so that their relatives can be given molecular diagnosis at an early stage through cascade screening.

In a previous study [21], we used word embeddings and a similarity-based ranking model to identify from the EHRs the top-ranked patients who were most likely to have any ciliopathy disorder. We obtained good precision scores but a very low sensitivity. Based on these results and considering the recent promising therapeutic advances for NPHP1 conditions, we decided to design a machine-learning based algorithm focusing on NPHP1 cases.

Detecting rare patients within large hospital EHR repositories is challenging for at least three reasons: the high dimensionality of data, the important imbalance between cases (the minority class) and controls (the majority class), and the dependence between phenotypes.

To address the high dimensionality issue, some feature selection processes can be applied to raw data [22]. It can rely on expert knowledge, or it can be an automated task trained to select the most relevant and nonredundant features for a specific classification task [22, 23]. Regarding imbalance, data-based methods, such as undersampling and oversampling, are often considered to balance the two classes.

To consider the dependence between phenotypes, the semantic similarity between two phenotypes may be calculated based on their relative position in a hierarchy and the information content (IC) of their most informative common ancestor (MICA) [24, 25]. An alternative consists of using phenotype embeddings, which are vector representations that capture word context in a document and encode its meaning such that concepts with similar meanings are closer in the vector space. However, these two methods were usually implemented as an isolated approach and were rarely considered in a machine learning context.

Our objective is to develop a supervised machine learning pipeline based on automatic phenotyping from unstructured EHRs to identify undiagnosed patients bearing NPHP1 variants while addressing the three aforementioned challenges. The patients classified as potentially having NPHP1 ciliopathy could then undergo genetic testing for diagnosis confirmation and be eligible for potential future treatment. This study was performed in the context of the C’IL-LICO project, a research initiative coordinated by the Imagine Institute that aims to accelerate the diagnosis and treatment of ciliopathy patients.

Methods

Databases and data encoding

The Necker-Enfants Malades Hospital is a French reference center for rare and undiagnosed diseases that belongs to the Greater Paris Hospital (AP-HP) and hosts the Imagine Institute, a research center specializing in genetic diseases. Its clinical data warehouse, Dr. Warehouse [16] contains EHRs of more than 800,000 patients, including more than 9 million clinical documents, which constitute a major source of clinical signs and phenotypes [26]. The high-throughput phenotyping module of Dr. Warehouse enables the automatic extraction of all types of clinical entities from EHRs based on the UMLS Metathesaurus® [27]. Besides detection and normalization, the module performs disambiguation, negation and subject prediction [28], which are fundamental to determine whether a clinical sign is present or absent, experienced by the patient or by their relatives. In this analysis, we focused on present phenotypes experienced by the patient. As the hierarchy of the Human Phenotype Ontology (HPO), a standard vocabulary for rare disease phenotypes used in clinical practice and genetic research, looked more appropriate than the UMLS network for computing semantic similarity, we converted all extracted UMLS phenotypes to HPO concepts based on the mapping provided by the HPO consortium (HPO format-version: 1.2; data-version: releases/2020-12-07; downloaded on 2020-12-10).

As a reference for ciliopathy cases, we used a research database named Cilio-base, developed at the Imagine Institute that contains structured research information for more than 1800 patients clinically diagnosed with ciliopathies, including diagnoses and causal genes. 1103 of the patients have biallelic variants in one causative gene identified. 116 different causative genes were identified in these patients. In the present study, we focused on NPHP1 ciliopathy cases followed at Necker Hospital (i.e., who have EHRs in Dr. Warehouse). Of note, Cilio-base does not contain patients diagnosed with autosomal dominant polycystic kidney disease (ADPKD) or autosomal recessive polycystic kidney disease (ARPKD) because these ciliopathy subtypes are well described and easier to diagnose.

Patient selection and extraction of phenotypes

We included as cases all patients with ciliopathy caused by NPHP1 biallelic pathogenic variants who were followed at Necker Hospital.

The main symptom associated with NPHP1 variants is the degradation of kidney function, which leads to ESKD, and consequently dialysis and/or transplantation. As our objective was to diagnose patients as early as possible and because post-transplantation events can provide noisy phenotypes, we decided to use only the phenotypes that occurred before the transplantation. The transplantation date was manually checked for all ciliopathy cases.

To distinguish between patients with NPHP1 ciliopathy and patients with other kidney diseases, we defined control patients as patients who exhibited some overlapping phenotypes with NPHP1 ciliopathy patients, namely, kidney defects. We thus randomly selected from Dr. Warehouse 10,000 nonciliopathy patients (i.e., excluding patients who were present in Cilio-base) having at least one automatically extracted UMLS concept subsumed by Kidney diseases ([C0022658]) or by Congenital anomaly of the kidney ([C0266292]) in their EHRs. The number of controls was determined to emulate the critical imbalance that exists in real-life conditions. Afterwards, control patients who had undergone a kidney transplantation were automatically excluded to avoid phenotypes associated with post-transplantation events.

For cases and controls, phenotypes were extracted from clinical notes using the phenotyping module of Dr Warehouse and converted from UMLS to HPO. To ensure that sufficient phenotypic information was available, we limited ourselves to cases and controls with at least four distinct HPO concepts extracted from their EHRs (Fig. 1).

Fig. 1
figure 1

Flowchart of the patient selection process

Semantic similarity based on HPO

We first considered the method published by Lin [25] to compute the similarity between HPO concepts. We chose this classic similarity-based method after performing some preliminary tests that showed very few differences in performance among the different methods. This finding is similar to the one of Jia et al. who concluded in their study on patient similarity measurement that “IC-based code-level similarity algorithms are comparable to each other” [29, 30]. This method starts by determining the IC of each phenotype p in the HPO, defined as \(IC\left(p\right)= -log(\frac{n}{N})\), with \(n\) the number of descendants of \(p\) and \(N\) the maximum number of concepts in the HPO. The similarity between two phenotypes \({p}_{1}\) and \({p}_{2}\) is calculated based on the IC of their MICA in the HPO:

$$Sim_{Lin} (p_{1,} p_{2} ) = \frac{{2 \times IC\left( {MICA} \right)}}{{IC\left( {p_{1} } \right) + IC\left( {p_{2} } \right)}}$$
(1)

One concern with this method is that two terms can be close in a hierarchy but have completely different mechanisms and etiologies, such as CKD and acute kidney injury (AKI). Consequently, we proposed an adjustment of the Lin method where the similarity between two terms is set to 0 if there is no subsumption relationship between them:

$$Sim_{RestrLin} (p_{1} ,p_{2} ) = \left\{ {\begin{array}{*{20}l} {\frac{{2 \times IC\left( {MICA} \right)}}{{IC\left( {p_{1} } \right) + IC\left( {p_{2} } \right)}}} \hfill & { if\, p_{1} \,and\, p_{2} \,on\, the \,same\, branch \,of\, the\, hierarchy } \hfill \\ {0 } \hfill & { otherwize. } \hfill \\ \end{array} } \right.$$
(2)

The similarity metrics SimLin and SimRestrLin are referred to as Lin similarity and restricted hierarchical similarity, respectively.

Semantic similarity with term embeddings

We considered two embedding methods: (1) fastText embeddings trained locally on 2.5 million documents from Necker [28], and (2) CODER embeddings [31], which are medical term embeddings infused with relational knowledge from UMLS showing good performance in various NLP tasks, outperforming other state-of-the-art biomedical word embeddings. In both cases, the semantic similarity between two phenotypes \({p}_{1}\) and \({p}_{2}\) is calculated using cosine similarity between two embeddings denoted as

$$Sim_{cos} \left( {p_{1} , p_{2} } \right)$$
(3)

These methods are referred to as fastText embeddings and CODER embeddings, respectively.

Patient representation

Let \(P\) denote the set of distinct phenotypes in our dataset, and let \(D\) denote the set of phenotypes present for a patient. Each patient is represented as a vector \(V\) with length(\(V\)) = Card(P). In the baseline approach, \(V\) is filled with binary values corresponding to the presence or absence of the phenotype \({p}_{i}\) for the patient. In the semantic cases, the vector \(V\) is filled with the maximum similarity between the phenotype \({p}_{i}\) and the set of phenotypes D present for the patient:

$$V\left[ i \right] = Max \left\{ {d_{j} \in D} \right\} Sim\left( {p_{i} ,d_{j} } \right)$$

Preprocessing and classification

To address the high dimensionality of the data, automatic feature selection using a Random Forest Classifier was applied as a preprocessing step, to keep only a subset of relevant phenotypes.

Regarding the imbalance between classes, we used an undersampling technique, Cluster Centroids, which selects a representative subset of controls based on the k-means algorithm.

Four classification methods, namely, a linear model, ridge regression (RidgeReg), a non-linear model, support vector machine (SVM), and two ensemble models, random forests (RF) and Gradient boosting (XGBoost), were considered. Gradient boosting is a well-established variant of ensemble learning methods, i.e., methods that combine many models together to create a final model. XGBoost has outperformed numerous machine learning models in recent challenges [32, 33]. We trained our models using fivefold cross-validation on two-thirds of the data and tested them on the remaining one-third.

For each classification method combined with the preprocessing steps (feature selection + undersampling), the hyperparameter settings reaching the best average performance within the cross-validation on the training set were used to predict the patient class on the test set (not undersampled). The range of hyperparameters tested for each classifier is given as supplementary Methods (see Additional file 1). As the number of cases was low, the whole process was repeated 10 times to increase the reliability.

Evaluation

Regarding performance metrics, to be able to compare the performances of the different algorithms and to get consistent and interpretable results, we fixed the specificity to 90% by tuning the decision thresholds of the different algorithms and we focused on the sensitivity. In addition, the predictive performance was evaluated by assessing the number of cases within the set of patients identified by the classifier as at highest risk of having ciliopathy: all patients were ranked using their class probability, then sensitivity (also called recall) and precision (also called predictive positive value) among the top 1% and 10% highest ranked patients were provided, which were denoted as recall@k% and precision@k%, respectively. We also provided metrics of the global performances of the classifiers, such as the area under the receiver operating curve (AUROC) and under the precision-recall curve (AUPRC), which is generally more suited for imbalanced datasets. AUPRC must be interpreted in the context of imbalanced data, as the value is strongly impacted by data imbalance. The overall evaluation process is summarized in Fig. 2.

Fig. 2
figure 2

Flowchart of the evaluation process

Implementation

Analyses were performed with Python version 3.8 using scikit-learn, imbalanced-learn, UMAP, statistics, matplotlib packages, and R version 3.6.3 using tidyverse, OntologyIndex, OntologySimilarity packages.

Results

Characteristics of the patients

Fifty-seven ciliopathy patients with NPHP1 biallelic pathogenic variants were followed at Necker Hospital with sufficient phenotypic information. Twenty-seven cases were excluded because these patients were hospitalized at our center exclusively to receive a kidney transplant and consequently had no phenotypes in their EHR before. Among the 30 remaining cases, 17 patients underwent a kidney transplantation, for whom only the phenotypes collected before the procedure were kept. Regarding diagnosis, 18 patients were diagnosed with an isolated form of NPH, 8 patients had NPH and neurological symptoms (3 patients with Joubert syndrome, 5 with other brain defects), and 4 patients were diagnosed with Senior-Løken syndrome (renal and ocular defects). Regarding the CKD severity, at the date of the most recent captured phenotype, 18 (60%) patients had not reached ESKD yet (2 patients with mild kidney damage and 16 patients with moderate to severe kidney damage), and 12 patients (40%) had reached ESKD. As mentioned in the Introduction, diagnosing these ESKD patients is less crucial although still very important to avoid diagnosing wandering.

A total of 10,000 controls with kidney-related symptoms and sufficient phenotypic information were randomly selected, among which 2769 patients with kidney transplants were removed.

The final dataset consisted of 30 ciliopathy cases and 7,231 controls (Fig. 1). Patient characteristics are displayed in Table 1. Regarding extracted HPO concepts (Table 2), general renal defects (e.g., renal insufficiency, abnormality of the kidney) were strongly represented among both cases and controls. HPO concepts that can be indicative of ciliopathy related CKD were more frequent among cases (e.g., anemia, hyperparathyroidism, polydipsia, enuresis).

Table 1 Patient characteristics
Table 2 Top 20 HPO concepts among cases and controls

Performance of the DSS

A total of 7,261 patients (NPHP1 + controls) presented 1,863 distinct phenotypes (Fig. 2). In each train-test split, 20 cases and 4,844 controls were used to train the model. At each round of the cross-validation, between 250 and 600 phenotypes were automatically kept after the feature selection, depending on the similarity method used. Regardless of the method considered, the selected features included phenotypes known to be associated with ciliopathy, such as CKD, polydipsia, and molar tooth sign on the MRI, as well as phenotypes such as AKI, which allows distinguishing between ciliopathy patients and controls.

The best hyperparameters were determined via grid search and then applied to the test set. The performance metrics were computed in the test set, then averaged across the 10 repetitions of train-test split for each classifier combined with each method of semantic similarity (Table 3).

Table 3 Performance of each classifier combined with each semantic similarity method in the test set

Within our pipeline (filtering based on kidney transplantation, feature selection), the baseline method already achieved high performances with all classifiers reaching sensitivity scores around 80% for a 90% specificity. AUROC [0.91–0.95] and AUPRC [0.44–0.47] were high considering our imbalanced context, as the AUPRC is expected to be 0.0041 using a random model (based on the prevalence of the cases in the dataset). The baseline model also showed a strong ability in identifying patients who were most likely to have ciliopathy since the recall@1% was between [54–61%]. Using Lin similarity decreased the sensitivity compared to the baseline method. However, the restriction to ascendants-descendants successfully addressed this issue and provided higher performances as expected. The restricted hierarchical similarity combined with ensemble methods (RF and XGBoost) reached overall the highest AUROC and highest sensitivity, demonstrating that the restricted hierarchical similarity was able to prevent the misclassification of control patients presenting phenotypes that were siblings of ciliopathy-related phenotypes. This can be illustrated with CKD versus AKI: while Lin similarity considers these two phenotypes as highly similar, the restricted hierarchical similarity between these two phenotypes is null, leading to fewer false positives and better overall performance. Considering the top ranked patients, restricted hierarchical similarity with RF showed a strong capacity to identify ciliopathy patients, with a recall@1% of 59% (i.e., precision around 25% whereas using a random model it would be 0.41%) and a recall@10% of 85%. Comparable performances among top 1% and top 10% were obtained by CODER embeddings with ridge regression. Embedding methods reached overall high performances, especially with non-ensemble methods.

Qualitative analysis

To get qualitative insights regarding the performance, we performed a manual review of the EHRs of some patients misclassified by the restricted hierarchical model with RF. We focused on this method because it achieved strong overall performances (AUROC = 0.93, AUPRC = 0.43) and high ability to mitigate the number of false positives among the top ranked patients (sensitivity = 85%, recall@1% = 59%, recall@10% = 85%). Over the 10 runs, five ciliopathy patients were systematically classified as controls by this method (false negatives). These patients were also often classified as controls by other methods. Four of them (P1, P3, P4, P5) had isolated NPH, and one patient (P2) had renal and brain defects. Despite the restriction to patients with at least four distinct HPO concepts, the review of these five patient profiles showed that their EHRs contained poor phenotypic descriptions regarding ciliopathy:

  • Three patients (P1–P3) were followed up in other hospitals and had only one document in Dr. Warehouse.

  • One patient (P4) underwent a kidney transplantation, and his only visit to Necker before that was for heart surgery. Consequently, he had no ciliopathy-related phenotype recorded in his EHR before transplantation.

  • The last patient (P5) had only a few extracted ciliopathy-related symptoms.

Notably, NPHP1 molecular diagnosis in P3 and P5 was based on family cascade screening, but these patients exhibited no (P5) or very mild (P3) symptoms upon genetic testing. Moreover, four patients (P1–3 and P5) had symptoms compatible with NPHP1 ciliopathy related CKD such as “reins discrètement hyperéchogènes” (mildly hyperechogenic kidneys), “infléchissement staturo-pondéral” (growth faltering) or “syndrome polyuro-polydipsique” (polyuro-polydipsia syndrome), that the UMLS-based extraction module of Dr. Warehouse failed to extract, because the expression used in the documents did not exist in the UMLS.

Interestingly, when performing uniform manifold approximation and projection (UMAP) using restricted hierarchical similarity (Fig. 3), four patients (P2–5) were grouped, whereas P1 was isolated in a cluster of controls containing many false positives. Presence of P1 in the training set may explain the number of false positives with common renal defects, which highlights the role of data quality especially in case of “small data”.

Fig. 3
figure 3

UMAP with restricted hierarchical similarity. TP true positives; FN false negatives, i.e., cases always misclassified by the method, corresponding to patients P1–5

Regarding false positives, a ciliopathy expert (SS) reviewed the EHRs of 40 control patients who were classified as having ciliopathies when using the default decision threshold and were grouped with true positives with UMAP. Among them, ten patients (25%) had phenotypes belonging to the ciliopathy spectrum:

  • Five patients were diagnosed with ADPKD (3 confirmed and 2 strongly suspected), and one patient was diagnosed with ARPKD. As explained in section “Databases and data encoding”, these patients were not included in Cilio-base. Consequently, some of them were present in our control dataset.

  • One patient had been suspected by clinicians to have NPH but then confirmed otherwise.

  • Three patients were suspected to have a pathogenic variant in the HNF1B gene, which is known to potentially impact ciliary function and to lead to overlapping phenotypes with ciliopathy related CKD.

Thus, our classification model classified as ciliopathy some controls who had phenotypes belonging to the ciliopathy spectrum, which is a good indicator that the pipeline can detect patients that could be suspected to have ciliopathy.

Other recurrent diagnoses included glomerulopathies (7 patients), infection-induced kidney disease (3 patients), AKI (3 patients), and chronic tubulointerstitial nephritis due to lithium administration (2 patients), which is a differential diagnosis for NPH.

Discussion

Main results

With the objective of distinguishing between NPHP1-mutated patients and other nephropathies, we designed a pipeline that considers the problem as a data-driven classification task. The supervised machine learning pipeline combines a phenotyping module re-using unstructured EHR data, semantic similarity measures to integrate dependence among phenotypes, a feature selection step to deal with the high dimensionality, an undersampling step to address the data imbalance, and a classification step with multiple train-test split for the small number of NPH cases.

The performances were generally high, especially when using the baseline model, the restricted hierarchical similarity or CODER embeddings. This is an encouraging result, given that 60% of NPHP1 patients have isolated NPH. Of note, the linear models achieved good results, and the general performances were not systematically improved using more complex models, as already observed by other authors [26, 34, 35]. Although the best AUPRC were obtained with the baseline model, the restricted hierarchical similarity reached high overall performances and the best sensitivity for a 90% specificity. Such high true positive rates are essential to ensure that the number of patients classified as ciliopathy, i.e., who would be referred to a geneticist, can be managed in real life.

Regarding other attempts to detect rare diseases patients among large-scale clinical databases, very few systems have been developed yet relying on machine learning and phenotypes. We identified seven articles corresponding to these criteria [34,35,36,37,38,39,40], all published between 2020 and 2022. When provided, the specificity reached by these systems was high [85–99%], whereas the sensitivity was generally low, ranging from 14 to 64% when data imbalance was considered.

Technical significance

Our approach differs from the disease recommendation systems usually used in the rare disease domain [4]. Indeed, disease recommendation systems can also rely on disease similarity [41, 42]. They generally work as follows: each rare disease is described by a set of phenotype concepts usually extracted from knowledge bases and the system returns a list of diseases ranked by the score of similarity between the patient and a rare disease. However, their objective is to provide a diagnosis for any undiagnosed rare disease patient. In this study, we investigated a different approach, which is to integrate phenotypic similarity into the patient vectorial representation and then classify patients using a machine learning predictive model. Such a data-driven approach aims at identifying patients with NPHP1 ciliopathy from large-scale clinical databases containing both rare and common diseases.

Our pipeline reached very good performance despite a very low number of NPHP1-mutated patients. Contrasting with the poor results obtained without undersampling, very high specificity was reached with ClusterCentroids undersampling, although only 0.41% of controls were used to train the model, showing that ClusterCentroids is a good strategy to minimize underfitting.

Contrasting with other approaches [35, 38,39,40], in our work, we minimized the integration of a priori expert knowledge. As knowledge regarding ciliopathies is still evolving, and as our objective is to design a pipeline that can be easily adapted to other rare diseases suffering from underdiagnosis, we decided to focus on a data-driven approach.

Regarding the similarity calculation, we think that the similarity between two phenotypes should integrate some physio-pathological features. Grani et al. [43], for example, stated that disease ontologies and taxonomies should integrate more etiology and genetic causes of diseases. The mitigated results obtained by the classic Lin similarity show the need to apply similarity methods on appropriate ontologies or to use methods where the proximity between concepts would reflect common mechanisms. Here, we demonstrated that similarity restricted to ascendant-descendant relations was more relevant than the basic Lin approach. Chen et al. reached a similar conclusion for developing a disease-ranking model for rare disease diagnoses [44]. However, we think further work is needed to improve the modeling of the dependence and relatedness between phenotypes. Methods based on node embeddings for example [45] could be interesting to test.

While most models trained to predict diagnosis have used only structured data until now [26, 34], it has been showed that text reports provide much more phenotypic information [15, 46, 47]. Of note, we did not process raw data such as images and measurements. However, we took all medical notes from the patients’ EHRs, which include all types of reports, such as radiology or laboratory test reports. All abnormal results are usually transcribed as phenotypes in the text of these reports (e.g., “renal cyst” or “small kidney” in the MRI report) and interpreted by the clinicians in their notes (e.g., translated into “proteinuria” or “anemia”). However, it is difficult to obtain complete patient representations due to poor-quality data, incompleteness or heterogeneity [47, 48]. The manual review of the false negatives showed that patients with very few visits—or even not followed at Necker Hospital—had phenotypic descriptions that were so poor that it was impossible to automatically categorize them as NPHP1-case. Moreover, some of them had ciliopathy-related phenotypes that were not captured by our thesaurus-based extraction pipelines. This issue is of major importance for rare diseases when the number of cases is very small and high-quality data is needed to train the model. Named Entity Recognition approaches such as PhenoBERT [49] or hybrid methods to link textual mentions to concepts in an ontology [50] obtained encouraging preliminary results. In another article [51], we showed how we strongly improved the automatic extraction of phenotypes from EHRs for a ciliopathy named Jeune syndrome using a hybrid method combining a dictionary-based method with deep learning. We plan to adapt this method [51] for NPHP1 patients and evaluate whether it can further improve patient classification. We expect to obtain an even better performance of our diagnosis algorithm with a more comprehensive set of automatically extracted phenotypes.

Clinical significance

According to König et al. [52], 63% of patients with NPHP1 variants reach ESKD at a mean age of 11.4. Recently, a group of researchers from the Imagine Institute identified a potential first-class treatment for NPHP1-associated renal ciliopathies [3]. Patients at risk of carrying this variant must be diagnosed without delay to have access to appropriate support and personalized care and to benefit from potential future treatments. Moreover, a recent study published by Snoek et al. [19] on more than 5000 ESKD patients showed that NPHP1 was a relatively frequent monogenic cause of adult-onset ESKD but that most patients with NPHP1 variants were given a diagnosis of “chronic kidney disease with unknown etiology” [19]. We trained and validated our system on a population from a children’s hospital, where the mean age was around 17 years old for ciliopathy patients and controls, and we plan to extend its coverage to adult nephropathy patients soon.

Another interesting point raised in this study is the high percentage (25%) of patients with phenotypes belonging to the ciliopathy spectrum in the set of patients considered “false positives”. This result demonstrates the ability of the system to find highly similar patients and the interest in applying this approach to domains where the distinction between cases and controls is still rather fuzzy.

Perspectives

We plan to conduct further investigations and tests on data from non-specialized hospitals (children and adults), where the probability to find undiagnosed NPHP1 patients is much higher than in reference centers for rare and undiagnosed diseases like Necker Hospital.

As previously stated, in our work, we minimized the integration of a priori expert knowledge; our approach is fully automated and relies on phenotypes and not on specific disease-related material and can be easily adapted to other rare diseases suffering from underdiagnosis. One of our perspectives is now to train and evaluate this current system on other ciliopathy patients. The next step will be to evaluate the impact of enriching our pipeline with extra information such as demographics, negative phenotypes [53, 54], and lab test results on the classification performance. We will evaluate it within a two-step pipeline, where such kind of information is used as a second step to improve the differential diagnosis process.

Conclusion

We developed a supervised machine learning pipeline based on deep phenotyping of EHRs that can help improve the detection of patients with rare and complex diseases such as ciliopathies. Relying on semantic similarity improved the performance, especially regarding the sensitivity. We think the lessons learned from this analysis can be of interest for other research teams working on rare disease diagnosis. The next steps will consist of testing our method on adult datasets and other hospitals and extending our data-driven approach further to other rare diseases.