Skip to main content

Advertisement

Log in

Gene pathogenicity prediction of Mendelian diseases via the random forest algorithm

  • Original Investigation
  • Published:
Human Genetics Aims and scope Submit manuscript

Abstract

The study of Mendelian diseases and the identification of their causative genes are of great significance in the field of genetics. The evaluation of the pathogenicity of genes and the total number of Mendelian disease genes are both important questions worth studying. However, very few studies have addressed these issues to date, so we attempt to answer them in this study. We calculated the gene pathogenicity prediction (GPP) score by a machine learning approach (random forest algorithm) to evaluate the pathogenicity of genes. When we applied the GPP score to the testing gene set, we obtained an accuracy of 80%, recall of 93% and area under the curve of 0.87. Our results estimated that a total of 10,384 protein-coding genes were Mendelian disease genes. Furthermore, we found the GPP score was positively correlated with the severity of disease. Our results indicate that GPP score may provide a robust and reliable guideline to predict the pathogenicity of protein-coding genes. To our knowledge, this is the first trial to estimate the total number of Mendelian disease genes.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3

Similar content being viewed by others

References

  • Adzhubei IA, Schmidt S, Peshkin L et al (2010) A method and server for predicting damaging missense mutations. Nat Methods 7:248–249

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  • Bamshad MJ, Ng SB, Bigham AW et al (2011) Exome sequencing as a tool for Mendelian disease gene discovery. Nat Rev Genet 12:745–755

    Article  CAS  PubMed  Google Scholar 

  • Chen R, Shi L, Hakenberg J et al (2016) Analysis of 589,306 genomes identifies individuals resilient to severe Mendelian childhood diseases. Nat Biotechnol 34:531–538

    Article  CAS  PubMed  Google Scholar 

  • Davydov EV, Goode DL, Sirota M et al (2010) Identifying a high fraction of the human genome to be under selective constraint using GERP++. PLoS Comput Biol 6:e1001025

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  • Ioannidis NM, Rothstein JH, Pejaver V et al (2016) REVEL: an ensemble method for predicting the pathogenicity of rare missense variants. Am J Hum Genet 99:877–885

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  • Itan Y, Shang L, Boisson B et al (2015) The human gene damage index as a gene-level approach to prioritizing exome variants. Proc Natl Acad Sci USA 112:13615–13620

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  • Kircher M, Witten DM, Jain P et al (2014) A general framework for estimating the relative pathogenicity of human genetic variants. Nat Genet 46:310–315

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  • Landrum MJ, Lee JM, Benson M et al (2018) ClinVar: improving access to variant interpretations and supporting evidence. Nucleic Acids Res 46:D1062–D1067

    Article  CAS  PubMed  Google Scholar 

  • Lek M, Karczewski KJ, Minikel EV et al (2016) Analysis of protein-coding genetic variation in 60,706 humans. Nature 536:285–291

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  • Li Z, Chen J, Yu H et al (2017) Genome-wide association analysis identifies 30 new susceptibility loci for schizophrenia. Nat Genet 49:1576–1583

    Article  CAS  PubMed  Google Scholar 

  • Ng PC, Henikoff S (2003) SIFT: predicting amino acid changes that affect protein function. Nucleic Acids Res 31:3812–3814

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  • Petrovski S, Wang Q, Heinzen EL et al (2013) Genic intolerance to functional variation and the interpretation of personal genomes. PLoS Genet 9:e1003709

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  • Quinodoz M, Royer-Bertrand B, Cisarova K et al (2017) DOMINO: using machine learning to predict genes associated with dominant disorders. Am J Hum Genet 101:623–629

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  • Rackham OJ, Shihab HA, Johnson MR, Petretto E (2015) EvoTol: a protein-sequence based evolutionary intolerance framework for disease-gene prioritization. Nucleic Acids Res 43:e33

    Article  CAS  PubMed  Google Scholar 

  • Rappaport N, Twik M, Plaschkes I et al (2017) MalaCards: an amalgamated human disease compendium with diverse clinical and genetic annotation and structured search. Nucleic Acids Res 45:D877–D887

    Article  CAS  PubMed  Google Scholar 

  • Samocha KE, Robinson EB, Sanders SJ et al (2014) A framework for the interpretation of de novo mutation in human disease. Nat Genet 46:944–950

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  • Schwarz JM, Cooper DN, Schuelke M, Seelow D (2014) MutationTaster2: mutation prediction for the deep-sequencing age. Nat Methods 11:361–362

    Article  CAS  PubMed  Google Scholar 

  • Sudmant PH, Rausch T, Gardner EJ et al (2015) An integrated map of structural variation in 2,504 human genomes. Nature 526:75–81

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  • Szklarczyk D, Morris JH, Cook H et al (2017) The STRING database in 2017: quality-controlled protein-protein association networks, made broadly accessible. Nucleic Acids Res 45:D362–D368

    Article  CAS  PubMed  Google Scholar 

  • Wang K, Li M, Hakonarson H (2010) ANNOVAR: functional annotation of genetic variants from high-throughput sequencing data. Nucleic Acids Res 38:e164

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  • Wenger AM, Guturu H, Bernstein JA, Bejerano G (2017) Systematic reanalysis of clinical exome data yields additional diagnoses: implications for providers. Genet Med 19:209–214

    Article  PubMed  Google Scholar 

Download references

Acknowledgements

We thank the providers and maintainers of the public databases we used in our study. We appreciate the support of Shenzhen Key Laboratory of Neurogenomics (CXB201108250094A). This study was supported by the National Natural Science Foundation of China (81771444) and the Shenzhen Municipal Government of China (NO.JCYJ20170412153248372).

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to Xiuqing Zhang or Jianguo Zhang.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (DOCX 18 kb)

439_2019_2021_MOESM2_ESM.pdf

Supplementary material 2 (PDF 312 kb) Supplementary Fig. 1 GO analysis of the predicted pathogenic and non-pathogenic genes of Mendelian diseases. The GO pathways analysis for reported_M, predicted_M and predicted_NM sets. The X-axis represents different GO pathways, while the Y-axis represents the percentage of genes in the selected pathway. The difference between the three gene sets are compared for each pathway. In general, the reported_M and the predicted_M sets show little difference, while the predicted_NM set show larger difference compared to them. Also, some opposite results are observed

439_2019_2021_MOESM3_ESM.png

Supplementary material 3 (PNG 233 kb) Supplementary Fig. 2 Distribution of predicted and reported Mendelian disease genes on each chromosome. The outer-layer circle colored black indicates the total protein-coding genes on each chromosome. The middle-layer circle colored red indicates the reported MD genes. The inner-layer circle colored blue indicates the predicted MD genes

439_2019_2021_MOESM4_ESM.pdf

Supplementary material 4 (PDF 126 kb) Supplementary Fig. 3 ROC curve of the GDP and GRP scores and the distribution of the scores of Mendelian disease genes. ROC curve of the GDP score. The AUC is 0.81. B. The distribution of the GDP score of MD genes indicates that fewer genes are dominant (the cutoff is 0.5). C. ROC curve of the GRP score. The AUC is 0.81. D. The distribution of the GRP score of MD genes indicates that more genes are recessive (the cutoff is 0.5)

Supplementary material 5 (XLSX 9 kb)

Supplementary material 6 (XLSX 19 kb)

Supplementary material 7 (XLSX 213 kb)

Supplementary material 8 (XLSX 712 kb)

Supplementary material 9 (XLSX 10 kb)

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

He, S., Chen, W., Liu, H. et al. Gene pathogenicity prediction of Mendelian diseases via the random forest algorithm. Hum Genet 138, 673–679 (2019). https://doi.org/10.1007/s00439-019-02021-9

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00439-019-02021-9

Navigation