Advances in computer-assisted syndrome recognition by the example of inborn errors of metabolism
Significant improvements in automated image analysis have been achieved in recent years and tools are now increasingly being used in computer-assisted syndromology. However, the ability to recognize a syndromic facial gestalt might depend on the syndrome and may also be confounded by severity of phenotype, size of available training sets, ethnicity, age, and sex. Therefore, benchmarking and comparing the performance of deep-learned classification processes is inherently difficult. For a systematic analysis of these influencing factors we chose the lysosomal storage diseases mucolipidosis as well as mucopolysaccharidosis type I and II that are known for their wide and overlapping phenotypic spectra. For a dysmorphic comparison we used Smith-Lemli-Opitz syndrome as another inborn error of metabolism and Nicolaides-Baraitser syndrome as another disorder that is also characterized by coarse facies. A classifier that was trained on these five cohorts, comprising 289 patients in total, achieved a mean accuracy of 62%. We also developed a simulation framework to analyze the effect of potential confounders, such as cohort size, age, sex, or ethnic background on the distinguishability of phenotypes. We found that the true positive rate increases for all analyzed disorders for growing cohorts (n = [10...40]) while ethnicity and sex have no significant influence. The dynamics of the accuracies strongly suggest that the maximum distinguishability is a phenotype-specific value, which has not been reached yet for any of the studied disorders. This should also be a motivation to further intensify data sharing efforts, as computer-assisted syndrome classification can still be improved by enlarging the available training sets.
Deep phenotyping for deep learning
Enzyme replacement therapy
Facial dysmorphology novel analysis
False negative rate
False positive rate
Human phenotype ontology
Lysosomal storage disease
- MPS I
Mucopolysaccharidosis type I
- MPS II
Mucopolysaccharidosis type II
Receiver operating characteristics
True positive rate
In syndromology the information content of the facial gestalt is so extraordinarily high that photographs are important in the diagnostic work-up. This also holds true for many inborn errors of metabolism that result in dysmorphic facial features (see also a corresponding list from IEMbase© in the Supplemental material). Recently, advances in computer vision improved pattern recognition on ordinary facial photos of syndromic patients (Boehringer et al 2006; Ferry et al 2014; Gurovich et al 2018). These approaches also have the potential to quantify the similarities of patients to any specific syndrome for which a model exists and to decide whether there is a significant difference between gene-phenotypes (Knaus et al 2018; Gurovich et al 2018).
Face2Gene (FDNA Inc., Boston MA, USA) is such a novel tool that supports pattern recognition in frontal photographs (https://face2gene.com). The facial analysis within Face2Gene is a deep convolutional neural network (DCNN) that is referred to as DeepGestalt. Currently, this DCNN is able to compare a photo to about 300 different syndromic phenotype models and to compute a similarity value (Gurovich et al 2018). The CLINIC application of Face2Gene provides a list of 30 differential diagnoses that are based on these gestalt scores.
While Face2Gene CLINIC makes the latest classification models available that were trained on the entire set of suitable cases the user community provided, a recently launched application, referred by RESEARCH, allows working with DeepGestalt in a controllable environment (Knaus et al 2018). This app can be used to learn the facial gestalts of different cohorts that share for example disease-causing mutations in the same gene or pathway. The results of an experiment are gestalt models suitable for binary and multi-class comparisons. The true positive rates (TPRs) as well as the error rates of the multi-class problem are reported in a confusion matrix, whereas the pairwise comparison of cohorts are evaluated as receiver operating characteristics (ROC) curves.
If the gestalt models achieve accuracies in the classification of photographs higher than randomly expected, there are recognizable facial patterns in individuals of a cohort. When phenotypes of the same molecular subgroup are compared, a significant distinguishability also means that a clinical entity can be delineated based on the facial gestalt. While this delineation of syndromic phenotypes has been reserved to a few experts in the field, computer-assisted pattern recognition might help to objectify, even quantify this process. However, if we interpret the accuracy of a classifier as the quantification of the distinguishability of disease-phenotypes, it is of utmost importance that the factor we are measuring is not confounded by, e.g., age, ethnicity or sex. In this work, we therefore present a framework for a systematic analysis of potential confounders that we tested on patients with inborn errors of metabolism (IEMs).
Patients and methods
We focused our analysis on IEMs and phenotypically similar disorders, 1) that have a high prevalence, 2) that are already represented in Face2Gene CLINIC, and 3) that are straightforward to confirm in the lab (Baehner et al 2005). We compiled an original sample set of 289 typical and atypical patients with mucoploysaccharidosis (MPS I and II), mucolipidosis (ML II alpha/beta and ML III alpha/beta), Smith-Lemli-Opitz syndrome (SLOS), and Nicolaides-Baraitser syndrome (NCBRS) that have all been molecularly confirmed (see Supplemental material for literature references). The facial gestalts of some patients are so similar, even for experts, that it is hard to tell the diseases apart without enzymatic or genetic testing. Due to this phenotypic overlap, our data set is also a challenging task for computer vision. In addition, especially within the IEMs, there is considerable phenotypic variability. For the lysosomal storage disorders (LSD) MPS and ML, hardly any symptoms are present at birth, but they usually appear during early childhood and progress during adolescence (Muenzer 2011). The extent of the enzyme deficiency influences the severity of the phenotype and in, e.g., MPS I, the genotype-phenotype correlations are also reflected by the clinical subdivision into Hurler, Hurler-Scheie, and Scheie syndrome (Bunge et al 1998). Although there is no cure for MPS, hematopoietic stem cell transplants or enzyme replacement therapies (ERT) have shown considerable treatment success that could also slow down the progression of symptoms (Kung et al 2013, 2015; Watson et al 2014; Bradley et al 2017; Kubaski et al 2017; Rodgers et al 2017). This also means treatment duration in addition to age might affect the severity of the phenotype in this disease cohort.
The phenotypic comparison of the cohorts was based on the clinical or molecular diagnosis and all experiments were run in Face2Gene’s RESEARCH application (version 17.6.2), which is accessible to registered users.
With our original sample set of 289 labeled photos, we were able to study the potential confounders cohort size, ethnic background, and sex. For an analysis of the intertwined factors age and treatment duration, we did not have sufficient individuals to form the required subsets. The performance on subsets was evaluated after random down-sampling to the same size. Based on the python requests library v2.18.4, we built a framework to automatize the repetition of experiments and the TPRs of the resulting confusion matrices were averaged over five iterations for each setting. The scripts for the simulations are available on request and can be used to reproduce the results.
The influence of the cohort size was analyzed by incrementing evenly sized subsets from 10 to 40. The change of the performance was fitted to a linear model and analyzed for significance using linregress of the SciPy library. The other potential cofounders, ethnic background and sex, were analyzed by excluding cohort size as a covariate. For these experiments, we sampled each cohort down to the greatest common size for each potential confounder. The greatest common size for the potential confounder male sex would, e.g., be 20, because there are only 20 male patients with MPS I in our original sample set (see Fig. 1). By this means cohort size has no influence on the performance and allowed an analysis of the potential confounders sex and ethnicity. Matthews correlation coefficient (MCC) is a measure of the quality of a two-class classification. Therefore, we reduced the multiclass confusion matrix to a two-class matrix for every diagnosis. Then we calculated the mean MCC for all iterations of the same experiment. If the difference of the MCCs of the potential confounder and control experiments was within the range of two standard deviations of the MCCs of the control experiments, we regarded the variable as not having a significant effect on the analyzed disease.
Classification of the original sample set in Face2Gene CLINIC and RESEARCH
Influence of growing cohort size on classification accuracy
Effect of ethnic background or sex on performance
We hypothesized, that a bias in the setup of the cohorts with respect to the ethnic background or the sex, might affect the performance. In general, the performance should drop if a true confounder is removed. If the performance increases instead after splitting up cohorts, this in an indicator that there is some characteristic feature that can be more efficiently learned in a more homogeneous group of patients.
The classification of DS is more accurate on only European or African patients. These marked differences cannot be observed for ML, MPS I, MPS II, SLOS, and NCBRS. Also, the restriction to only male patients has only a minor effect on the performance. The difference of MCCs for the binary classification of every disease was normalized by the standard deviations of MCCs that were computed in the mixed controls
CEU vs mixed:
CEU vs mixed:
AFR vs mixed:
male vs mixed:
In contrast to DS, we did not observe such marked differences in the MCCs for MPS I, MPS II, ML, SLOS, and NCBRS, when running the experiments for n = 22 cohorts that consisted only of European patients. An analysis for another background in these disorders was not possible due to a lack of sufficient patients.
Another potential confounder in the five-class problem of MPS I, MPS II, ML, SLOS, and NCBRS that we analyzed is sex. All but two of the MPS II patients were male, whereas the sex ratios for the other disorders were close to 1. This means knowing the sex would help with distinguishing MPS II from MPS I cases. Interestingly, however, the MCC for the MPS II classification did not decrease, when all other cohorts were also restricted to male patients only and same cohort sizes of n = 20. This indicates that a bias in the sex ratios does not affect the performance of the classification process substantially for the tested syndromes.
The TPRs that were achieved for all disorders in the five-class problems were higher than expected by random chance. Thus, our results show that the FDNA technology is capable of delineating gestalt differences even for clinically similar phenotypes. This finding is especially remarkable for the phenotypes of MPS and ML and is also supported by high AUROC values in binary classifications (Suppl. Fig. 1).
The difference in TPRs for the syndromes could be interpreted as different recognizabilities. Notably, SLOS and NCBRS are more recognizable than MPS I, MPS II, and ML. This corresponds to the results from the CLINIC app, where ML and MPS show lower distinguishability. These findings are in agreement with geneticist expert opinion, who label ML as highly similar to MPS.
The high TPRs found in our analyses corresponds to the results of two other studies on phenotypes of molecular pathway disorders. For Noonan syndrome as well as for GPI-anchor deficiencies, significant phenotypic substructures could be detected. This also illustrates that an even more fine-grained phenotype modeling might be possible with the CLINIC app in the future.
Distinguishing MPS I from the other disorders was slightly more effective when working in a European background. A possible explanation for this slight increase in performance might be that there are certain features that are restricted or more prominent in European patients and that might therefore be learned more effectively if relatively more cases are used for training the model. This issue has already been discussed for other disorders, such as Fragile-X syndrome and Down syndrome, were ethnic specific differences in the feature presentation are known (Schwartz et al 1988; Lumaka et al 2017). Although we could replicate these effects for DS, we did not see a prominent change in the performance in the other phenotypes, which indicates that that ethnic background is not a strong confounder in the classification process.
The human face shows a sexual dimorphism, possibly even at an early age, making sex a potential confounder in any facial image analysis process (Zhang et al 2016). The classification accuracies in our experiments that were based on data sets adjusted to individuals of the same sex, did not significantly differ, suggesting that the classification method is robust to sex as a confounder. Also, the mean MCCs showed no significant change when training the classifier on only male individuals as compared to a training cohort consisting of both sexes. Our interpretation is that sex does not confound the classification of MPS I, MPS II, ML, SLOS, and NCBRS.
We are just beginning to understand the potential of computer-assisted image analysis in the field of syndromology. In this work we have presented a general approach to study the distinguishability of a phenotype and to test the confounding effect of variables such as ethnicity or sex. We have applied this framework to a selection of inborn errors of metabolism, however, in principle, it is applicable to any other disorders.
It would also be interesting to compare the performance of the FDNA technology to the accuracies of other, previously published approaches of automated image analysis of syndromic patients. Comparative evaluation, however, is impeded by the lack of a publicly available data set for benchmarking. Earlier benchmarking approaches merely relied on the comparison to a human classification performance. To achieve an objective evaluation of computer vision, we strongly advocate to build a resource for image data of molecularly confirmed syndromic cases.
Conclusion and outlook
In this work we report on a next-generation phenotyping technology that can be used to study the similarities and differences between patients with rare genetic disorders. The framework that we present is not only suited to measure the accuracies of the DCNN in the classification process but also to test for confounding effects. Especially with respect to the novel and powerful methods in artificial intelligence, it is crucial to learn more about what is actually quantified by a DCNN. Our results show that DeepGestalt, the next-generation-phenotyping technology within Face2Gene, is not confounded by sex or ethnic background for the studied phenotypes. The high predictive value for IEMs in the CLINIC application also makes Face2Gene a valuable tool to detect these kinds of disorders. This is especially of importance for patients that might have evaded an early detection by new born screenings. The importance of such programs is, however, untouched as the outcome improves the earlier ERT can be started and the evolving phenotype of IEMs might be more difficult to detect in newborns than in older age groups.
Apart from detection, an even more important role of computer vision could be disease monitoring if a neural network is not only able to sense the presence of a disease but also to quantify features that, e.g., mirror the progress of GAG deposition. We hope to be able to investigate this question in future research when more data becomes available.
J.T.P. and M.A.M. were supported by the BIH Clinician scientist program. P.M.K received funding from the German Research Foundation (KR 3985/7-3).
Compliance with ethical standards
This article does not contain any studies with human or animal subjects performed by the any of the authors.
Conflict of interest
J.T. Pantel, M.Zhao, M.A. Mensah, N. Hajjir, T.-C. Hsieh, Y. Hanani, N. Fleischer, T. Kamphans, S. Mundlos, Y. Gurovich and P.M. Krawitz declare that they have no conflict of interest.
- Bradley LA, Haddow HRM, Palomaki GE (2017) Treatment of mucopolysaccharidosis type II (hunter syndrome): results from a systematic evidence review. Genet Med. https://doi.org/10.1038/gim.2017.30
- Ferry Q, Steinberg J, Webber C et al (2014) Diagnostically relevant facial gestalt information from ordinary photos. Elife 3:e02020Google Scholar
- Gurovich Y, Hanani Y, Bar O et al (2018) DeepGestalt — identifying rare genetic syndromes using deep learning. arXiv:1801.07637 Google Scholar
- Kubaski F, Yabe H, Suzuki Y et al (2017) Hematopoietic stem cell transplantation for patients with mucopolysaccharidosis II. Biol Blood Marrow Transpl. https://doi.org/10.1016/j.bbmt.2017.06.020
- Kung S, Walters M, Claes P et al (2013) A dysmorphometric analysis to investigate facial phenotypic signatures as a foundation for non-invasive monitoring of lysosomal storage disorders. JIMD Rep 8:31–39. https://doi.org/10.1007/8904_2012_152
- Kung S, Walters M, Claes P et al (2015) Monitoring of therapy for mucopolysaccharidosis type I using dysmorphometric facial phenotypic signatures. JIMD Rep 22:99–106. https://doi.org/10.1007/8904_2015_417
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.