Abstract
The study of Mendelian diseases and the identification of their causative genes are of great significance in the field of genetics. The evaluation of the pathogenicity of genes and the total number of Mendelian disease genes are both important questions worth studying. However, very few studies have addressed these issues to date, so we attempt to answer them in this study. We calculated the gene pathogenicity prediction (GPP) score by a machine learning approach (random forest algorithm) to evaluate the pathogenicity of genes. When we applied the GPP score to the testing gene set, we obtained an accuracy of 80%, recall of 93% and area under the curve of 0.87. Our results estimated that a total of 10,384 protein-coding genes were Mendelian disease genes. Furthermore, we found the GPP score was positively correlated with the severity of disease. Our results indicate that GPP score may provide a robust and reliable guideline to predict the pathogenicity of protein-coding genes. To our knowledge, this is the first trial to estimate the total number of Mendelian disease genes.
Similar content being viewed by others
References
Adzhubei IA, Schmidt S, Peshkin L et al (2010) A method and server for predicting damaging missense mutations. Nat Methods 7:248–249
Bamshad MJ, Ng SB, Bigham AW et al (2011) Exome sequencing as a tool for Mendelian disease gene discovery. Nat Rev Genet 12:745–755
Chen R, Shi L, Hakenberg J et al (2016) Analysis of 589,306 genomes identifies individuals resilient to severe Mendelian childhood diseases. Nat Biotechnol 34:531–538
Davydov EV, Goode DL, Sirota M et al (2010) Identifying a high fraction of the human genome to be under selective constraint using GERP++. PLoS Comput Biol 6:e1001025
Ioannidis NM, Rothstein JH, Pejaver V et al (2016) REVEL: an ensemble method for predicting the pathogenicity of rare missense variants. Am J Hum Genet 99:877–885
Itan Y, Shang L, Boisson B et al (2015) The human gene damage index as a gene-level approach to prioritizing exome variants. Proc Natl Acad Sci USA 112:13615–13620
Kircher M, Witten DM, Jain P et al (2014) A general framework for estimating the relative pathogenicity of human genetic variants. Nat Genet 46:310–315
Landrum MJ, Lee JM, Benson M et al (2018) ClinVar: improving access to variant interpretations and supporting evidence. Nucleic Acids Res 46:D1062–D1067
Lek M, Karczewski KJ, Minikel EV et al (2016) Analysis of protein-coding genetic variation in 60,706 humans. Nature 536:285–291
Li Z, Chen J, Yu H et al (2017) Genome-wide association analysis identifies 30 new susceptibility loci for schizophrenia. Nat Genet 49:1576–1583
Ng PC, Henikoff S (2003) SIFT: predicting amino acid changes that affect protein function. Nucleic Acids Res 31:3812–3814
Petrovski S, Wang Q, Heinzen EL et al (2013) Genic intolerance to functional variation and the interpretation of personal genomes. PLoS Genet 9:e1003709
Quinodoz M, Royer-Bertrand B, Cisarova K et al (2017) DOMINO: using machine learning to predict genes associated with dominant disorders. Am J Hum Genet 101:623–629
Rackham OJ, Shihab HA, Johnson MR, Petretto E (2015) EvoTol: a protein-sequence based evolutionary intolerance framework for disease-gene prioritization. Nucleic Acids Res 43:e33
Rappaport N, Twik M, Plaschkes I et al (2017) MalaCards: an amalgamated human disease compendium with diverse clinical and genetic annotation and structured search. Nucleic Acids Res 45:D877–D887
Samocha KE, Robinson EB, Sanders SJ et al (2014) A framework for the interpretation of de novo mutation in human disease. Nat Genet 46:944–950
Schwarz JM, Cooper DN, Schuelke M, Seelow D (2014) MutationTaster2: mutation prediction for the deep-sequencing age. Nat Methods 11:361–362
Sudmant PH, Rausch T, Gardner EJ et al (2015) An integrated map of structural variation in 2,504 human genomes. Nature 526:75–81
Szklarczyk D, Morris JH, Cook H et al (2017) The STRING database in 2017: quality-controlled protein-protein association networks, made broadly accessible. Nucleic Acids Res 45:D362–D368
Wang K, Li M, Hakonarson H (2010) ANNOVAR: functional annotation of genetic variants from high-throughput sequencing data. Nucleic Acids Res 38:e164
Wenger AM, Guturu H, Bernstein JA, Bejerano G (2017) Systematic reanalysis of clinical exome data yields additional diagnoses: implications for providers. Genet Med 19:209–214
Acknowledgements
We thank the providers and maintainers of the public databases we used in our study. We appreciate the support of Shenzhen Key Laboratory of Neurogenomics (CXB201108250094A). This study was supported by the National Natural Science Foundation of China (81771444) and the Shenzhen Municipal Government of China (NO.JCYJ20170412153248372).
Author information
Authors and Affiliations
Corresponding authors
Ethics declarations
Conflict of interest
The authors declare that they have no conflict of interest.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Electronic supplementary material
Below is the link to the electronic supplementary material.
439_2019_2021_MOESM2_ESM.pdf
Supplementary material 2 (PDF 312 kb) Supplementary Fig. 1 GO analysis of the predicted pathogenic and non-pathogenic genes of Mendelian diseases. The GO pathways analysis for reported_M, predicted_M and predicted_NM sets. The X-axis represents different GO pathways, while the Y-axis represents the percentage of genes in the selected pathway. The difference between the three gene sets are compared for each pathway. In general, the reported_M and the predicted_M sets show little difference, while the predicted_NM set show larger difference compared to them. Also, some opposite results are observed
439_2019_2021_MOESM3_ESM.png
Supplementary material 3 (PNG 233 kb) Supplementary Fig. 2 Distribution of predicted and reported Mendelian disease genes on each chromosome. The outer-layer circle colored black indicates the total protein-coding genes on each chromosome. The middle-layer circle colored red indicates the reported MD genes. The inner-layer circle colored blue indicates the predicted MD genes
439_2019_2021_MOESM4_ESM.pdf
Supplementary material 4 (PDF 126 kb) Supplementary Fig. 3 ROC curve of the GDP and GRP scores and the distribution of the scores of Mendelian disease genes. ROC curve of the GDP score. The AUC is 0.81. B. The distribution of the GDP score of MD genes indicates that fewer genes are dominant (the cutoff is 0.5). C. ROC curve of the GRP score. The AUC is 0.81. D. The distribution of the GRP score of MD genes indicates that more genes are recessive (the cutoff is 0.5)
Rights and permissions
About this article
Cite this article
He, S., Chen, W., Liu, H. et al. Gene pathogenicity prediction of Mendelian diseases via the random forest algorithm. Hum Genet 138, 673–679 (2019). https://doi.org/10.1007/s00439-019-02021-9
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00439-019-02021-9