Gene pathogenicity prediction of Mendelian diseases via the random forest algorithm

He, Sijie; Chen, Weiwei; Liu, Hankui; Li, Shengting; Lei, Dongzhu; Dang, Xiao; Chen, Yulan; Zhang, Xiuqing; Zhang, Jianguo

doi:10.1007/s00439-019-02021-9

Gene pathogenicity prediction of Mendelian diseases via the random forest algorithm

Original Investigation
Published: 08 May 2019

Volume 138, pages 673–679, (2019)
Cite this article

Human Genetics Aims and scope Submit manuscript

Sijie He ORCID: orcid.org/0000-0002-4418-1785^1,2,3,4^na1,
Weiwei Chen^1,2^na1,
Hankui Liu^2,3,4,
Shengting Li²,
Dongzhu Lei⁵,
Xiao Dang^2,3,4,
Yulan Chen^2,3,4,
Xiuqing Zhang^2,4 &
…
Jianguo Zhang^2,3,4

676 Accesses
4 Citations
Explore all metrics

Abstract

The study of Mendelian diseases and the identification of their causative genes are of great significance in the field of genetics. The evaluation of the pathogenicity of genes and the total number of Mendelian disease genes are both important questions worth studying. However, very few studies have addressed these issues to date, so we attempt to answer them in this study. We calculated the gene pathogenicity prediction (GPP) score by a machine learning approach (random forest algorithm) to evaluate the pathogenicity of genes. When we applied the GPP score to the testing gene set, we obtained an accuracy of 80%, recall of 93% and area under the curve of 0.87. Our results estimated that a total of 10,384 protein-coding genes were Mendelian disease genes. Furthermore, we found the GPP score was positively correlated with the severity of disease. Our results indicate that GPP score may provide a robust and reliable guideline to predict the pathogenicity of protein-coding genes. To our knowledge, this is the first trial to estimate the total number of Mendelian disease genes.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Genomic Data Machined: The Random Forest Algorithm for Discovering Breast Cancer Biomarkers

Evaluation of tree-based statistical learning methods for constructing genetic risk scores

Article Open access 21 March 2022

Combining Mutation and Gene Network Data in a Machine Learning Approach for False-Positive Cancer Driver Gene Discovery

References

Adzhubei IA, Schmidt S, Peshkin L et al (2010) A method and server for predicting damaging missense mutations. Nat Methods 7:248–249
Article CAS PubMed PubMed Central Google Scholar
Bamshad MJ, Ng SB, Bigham AW et al (2011) Exome sequencing as a tool for Mendelian disease gene discovery. Nat Rev Genet 12:745–755
Article CAS PubMed Google Scholar
Chen R, Shi L, Hakenberg J et al (2016) Analysis of 589,306 genomes identifies individuals resilient to severe Mendelian childhood diseases. Nat Biotechnol 34:531–538
Article CAS PubMed Google Scholar
Davydov EV, Goode DL, Sirota M et al (2010) Identifying a high fraction of the human genome to be under selective constraint using GERP++. PLoS Comput Biol 6:e1001025
Article CAS PubMed PubMed Central Google Scholar
Ioannidis NM, Rothstein JH, Pejaver V et al (2016) REVEL: an ensemble method for predicting the pathogenicity of rare missense variants. Am J Hum Genet 99:877–885
Article CAS PubMed PubMed Central Google Scholar
Itan Y, Shang L, Boisson B et al (2015) The human gene damage index as a gene-level approach to prioritizing exome variants. Proc Natl Acad Sci USA 112:13615–13620
Article CAS PubMed PubMed Central Google Scholar
Kircher M, Witten DM, Jain P et al (2014) A general framework for estimating the relative pathogenicity of human genetic variants. Nat Genet 46:310–315
Article CAS PubMed PubMed Central Google Scholar
Landrum MJ, Lee JM, Benson M et al (2018) ClinVar: improving access to variant interpretations and supporting evidence. Nucleic Acids Res 46:D1062–D1067
Article CAS PubMed Google Scholar
Lek M, Karczewski KJ, Minikel EV et al (2016) Analysis of protein-coding genetic variation in 60,706 humans. Nature 536:285–291
Article CAS PubMed PubMed Central Google Scholar
Li Z, Chen J, Yu H et al (2017) Genome-wide association analysis identifies 30 new susceptibility loci for schizophrenia. Nat Genet 49:1576–1583
Article CAS PubMed Google Scholar
Ng PC, Henikoff S (2003) SIFT: predicting amino acid changes that affect protein function. Nucleic Acids Res 31:3812–3814
Article CAS PubMed PubMed Central Google Scholar
Petrovski S, Wang Q, Heinzen EL et al (2013) Genic intolerance to functional variation and the interpretation of personal genomes. PLoS Genet 9:e1003709
Article CAS PubMed PubMed Central Google Scholar
Quinodoz M, Royer-Bertrand B, Cisarova K et al (2017) DOMINO: using machine learning to predict genes associated with dominant disorders. Am J Hum Genet 101:623–629
Article CAS PubMed PubMed Central Google Scholar
Rackham OJ, Shihab HA, Johnson MR, Petretto E (2015) EvoTol: a protein-sequence based evolutionary intolerance framework for disease-gene prioritization. Nucleic Acids Res 43:e33
Article CAS PubMed Google Scholar
Rappaport N, Twik M, Plaschkes I et al (2017) MalaCards: an amalgamated human disease compendium with diverse clinical and genetic annotation and structured search. Nucleic Acids Res 45:D877–D887
Article CAS PubMed Google Scholar
Samocha KE, Robinson EB, Sanders SJ et al (2014) A framework for the interpretation of de novo mutation in human disease. Nat Genet 46:944–950
Article CAS PubMed PubMed Central Google Scholar
Schwarz JM, Cooper DN, Schuelke M, Seelow D (2014) MutationTaster2: mutation prediction for the deep-sequencing age. Nat Methods 11:361–362
Article CAS PubMed Google Scholar
Sudmant PH, Rausch T, Gardner EJ et al (2015) An integrated map of structural variation in 2,504 human genomes. Nature 526:75–81
Article CAS PubMed PubMed Central Google Scholar
Szklarczyk D, Morris JH, Cook H et al (2017) The STRING database in 2017: quality-controlled protein-protein association networks, made broadly accessible. Nucleic Acids Res 45:D362–D368
Article CAS PubMed Google Scholar
Wang K, Li M, Hakonarson H (2010) ANNOVAR: functional annotation of genetic variants from high-throughput sequencing data. Nucleic Acids Res 38:e164
Article CAS PubMed PubMed Central Google Scholar
Wenger AM, Guturu H, Bernstein JA, Bejerano G (2017) Systematic reanalysis of clinical exome data yields additional diagnoses: implications for providers. Genet Med 19:209–214
Article PubMed Google Scholar

Download references

Acknowledgements

We thank the providers and maintainers of the public databases we used in our study. We appreciate the support of Shenzhen Key Laboratory of Neurogenomics (CXB201108250094A). This study was supported by the National Natural Science Foundation of China (81771444) and the Shenzhen Municipal Government of China (NO.JCYJ20170412153248372).

Author information

Sijie He and Weiwei Chen contributed equally to the work.

Authors and Affiliations

BGI Education Center, University of Chinese Academy of Sciences, Shenzhen, 518083, China
Sijie He & Weiwei Chen
BGI-Shenzhen, Shenzhen, 518083, China
Sijie He, Weiwei Chen, Hankui Liu, Shengting Li, Xiao Dang, Yulan Chen, Xiuqing Zhang & Jianguo Zhang
Shenzhen Key Laboratory of Neurogenomics, BGI-Shenzhen, Shenzhen, 518083, China
Sijie He, Hankui Liu, Xiao Dang, Yulan Chen & Jianguo Zhang
China National GeneBank, BGI-Shenzhen, Shenzhen, 518120, China
Sijie He, Hankui Liu, Xiao Dang, Yulan Chen, Xiuqing Zhang & Jianguo Zhang
Center of Prenatal Diagnosis, ChenZhou No. 1 People’s Hospital, Hunan, 423000, China
Dongzhu Lei

Authors

Sijie He
View author publications
You can also search for this author in PubMed Google Scholar
Weiwei Chen
View author publications
You can also search for this author in PubMed Google Scholar
Hankui Liu
View author publications
You can also search for this author in PubMed Google Scholar
Shengting Li
View author publications
You can also search for this author in PubMed Google Scholar
Dongzhu Lei
View author publications
You can also search for this author in PubMed Google Scholar
Xiao Dang
View author publications
You can also search for this author in PubMed Google Scholar
Yulan Chen
View author publications
You can also search for this author in PubMed Google Scholar
Xiuqing Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Jianguo Zhang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to Xiuqing Zhang or Jianguo Zhang.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (DOCX 18 kb)

439_2019_2021_MOESM2_ESM.pdf

Supplementary material 2 (PDF 312 kb) Supplementary Fig. 1 GO analysis of the predicted pathogenic and non-pathogenic genes of Mendelian diseases. The GO pathways analysis for reported_M, predicted_M and predicted_NM sets. The X-axis represents different GO pathways, while the Y-axis represents the percentage of genes in the selected pathway. The difference between the three gene sets are compared for each pathway. In general, the reported_M and the predicted_M sets show little difference, while the predicted_NM set show larger difference compared to them. Also, some opposite results are observed

439_2019_2021_MOESM3_ESM.png

Supplementary material 3 (PNG 233 kb) Supplementary Fig. 2 Distribution of predicted and reported Mendelian disease genes on each chromosome. The outer-layer circle colored black indicates the total protein-coding genes on each chromosome. The middle-layer circle colored red indicates the reported MD genes. The inner-layer circle colored blue indicates the predicted MD genes

439_2019_2021_MOESM4_ESM.pdf

Supplementary material 4 (PDF 126 kb) Supplementary Fig. 3 ROC curve of the GDP and GRP scores and the distribution of the scores of Mendelian disease genes. ROC curve of the GDP score. The AUC is 0.81. B. The distribution of the GDP score of MD genes indicates that fewer genes are dominant (the cutoff is 0.5). C. ROC curve of the GRP score. The AUC is 0.81. D. The distribution of the GRP score of MD genes indicates that more genes are recessive (the cutoff is 0.5)

Supplementary material 5 (XLSX 9 kb)

Supplementary material 6 (XLSX 19 kb)

Supplementary material 7 (XLSX 213 kb)

Supplementary material 8 (XLSX 712 kb)

Supplementary material 9 (XLSX 10 kb)

Rights and permissions

Reprints and permissions

About this article

Cite this article

He, S., Chen, W., Liu, H. et al. Gene pathogenicity prediction of Mendelian diseases via the random forest algorithm. Hum Genet 138, 673–679 (2019). https://doi.org/10.1007/s00439-019-02021-9

Download citation

Received: 29 October 2018
Accepted: 19 April 2019
Published: 08 May 2019
Issue Date: 01 June 2019
DOI: https://doi.org/10.1007/s00439-019-02021-9

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Gene pathogenicity prediction of Mendelian diseases via the random forest algorithm

Abstract

Access this article

Similar content being viewed by others

Genomic Data Machined: The Random Forest Algorithm for Discovering Breast Cancer Biomarkers

Evaluation of tree-based statistical learning methods for constructing genetic risk scores

Combining Mutation and Gene Network Data in a Machine Learning Approach for False-Positive Cancer Driver Gene Discovery

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding authors

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Electronic supplementary material

Supplementary material 1 (DOCX 18 kb)

439_2019_2021_MOESM2_ESM.pdf

439_2019_2021_MOESM3_ESM.png

439_2019_2021_MOESM4_ESM.pdf

Supplementary material 5 (XLSX 9 kb)

Supplementary material 6 (XLSX 19 kb)

Supplementary material 7 (XLSX 213 kb)

Supplementary material 8 (XLSX 712 kb)

Supplementary material 9 (XLSX 10 kb)

Rights and permissions

About this article

Cite this article

Navigation

Gene pathogenicity prediction of Mendelian diseases via the random forest algorithm

Abstract

Access this article

Similar content being viewed by others

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding authors

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Electronic supplementary material

Rights and permissions

About this article

Cite this article

Share this article

Search

Navigation