Skip to main content

Advertisement

Log in

Machine learning techniques for pathogenicity prediction of non-synonymous single nucleotide polymorphisms in human body

  • Original Research
  • Published:
Journal of Ambient Intelligence and Humanized Computing Aims and scope Submit manuscript

A Correction to this article was published on 22 January 2022

This article has been updated

Abstract

The rapidly growing human genetic variation data resulted from high-throughput genotyping and sequencing methods has motivated bioinformaticians to find solutions to deal with this huge amount of data. Single Nucleotide Polymorphism (SNP) is considered one of the important reasons for human genome variability and is involved in many human diseases, such as cancer. The non-synonymous SNP (nsSNP) mutation in the coding region causes amino acid changes which may produce protein functional alterations that may affect cell proliferation and thus it is the reason for many diseases. In this research, a framework is proposed to distinguish disease-causing missense variants from neutral mutations, as its task is pathogenicity prediction. The framework contains two main components which are an attributes selection tool and a classifier based on machine learning techniques (MLTs). The attributes selection tool works conjunctionally with the classifier to select the set of attributes that can classify the required dataset with the highest possible accuracy and a minimum set of attributes. The attributes selection tool is based on the swarm intelligence optimization approach which can optimize the set of selected attributes. Artificial neural network is the adopted MLT in this research, while decision tree and K-nearest Neighbor are investigated for the purpose of comparison. Benchmark datasets were used to evaluate the proposed framework yielding promising findings and results. The results of the proposed framework outperformed other methods on the same benchmark datasets. The best artificial neural network models achieved accuracies of 82%, 96%, 76.3%, for NovelVar, VaribenchSelectedPure and SwissvarFilteredMix respectively. For more validation, the three datasets were combined to form one large dataset, and the achieved accuracy was 82.9%.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16

Similar content being viewed by others

Change history

References

  • Adzhubei IA et al (2010) A method and server for predicting damaging missense mutations. Nat Methods 7(4):248–249

    Article  Google Scholar 

  • Alomari A et al (2018) Swarm intelligence optimization techniques for obstacle-avoidance mobility-assisted localization in wireless sensor networks. IEEE Access 6:22368–22385

    Article  Google Scholar 

  • Frazer KA et al (2009) Human genetic variation and its contribution to complex traits. Nat Rev Genet 10(4):241–251

    Article  Google Scholar 

  • Fredman D et al (2004) HGVbase: a curated resource describing human DNA variation and phenotype relationships. Nucleic Acids Res 32(suppl_1):516–519

    Article  Google Scholar 

  • González-Pérez A, López-Bigas N (2011) Improving the assessment of the outcome of nonsynonymous SNVs with a consensus deleteriousness score, Condel. Am J Hum Genet 88(4):440–449

    Article  Google Scholar 

  • Grimm DG et al (2015) The evaluation of tools used to predict the impact of missense variants is hindered by two types of circularity. Hum Mutat 36(5):513–523

    Article  Google Scholar 

  • Han J, Pei J, Kamber M (2011) Data mining: concepts and techniques. Elsevier, Amsterdam

    MATH  Google Scholar 

  • Hecht M, Bromberg Y, Rost B (2015) Better prediction of functional effects for sequence variants. BMC Genomics 16(S8):S1

    Article  Google Scholar 

  • El Houby EM, Yassin NI, Omran S (2017) A hybrid approach from ant Colony optimization and K-nearest neighbor for classifying datasets using selected features. Informatica 41(4)

  • Isakov O, Dotan I, Ben-Shachar S (2017) Machine learning–based gene prioritization identifies novel candidate risk genes for inflammatory bowel disease. Inflamm Bowel Dis 23(9):1516–1523

    Article  Google Scholar 

  • Joseph PV et al (2018) A computational framework for predicting obesity risk based on optimizing and integrating genetic risk score and gene expression profiles. PLoS ONE 13(5):e0197843

    Article  Google Scholar 

  • Kircher M et al (2014) A general framework for estimating the relative pathogenicity of human genetic variants. Nat Genet 46(3):310–315

    Article  Google Scholar 

  • Li M-X et al (2013) Predicting mendelian disease-causing non-synonymous single nucleotide variants in exome sequencing studies. PLoS Genet 9(1):e1003143

    Article  Google Scholar 

  • Li G et al (2020) Application of deep canonically correlated sparse autoencoder for the classification of schizophrenia. Comput Methods Programs Biomed 183:105073

    Article  Google Scholar 

  • López B et al (2018) Single Nucleotide Polymorphism relevance learning with Random Forests for Type 2 diabetes risk prediction. Artif Intell Med 85:43–49

    Article  Google Scholar 

  • Montaez CAC et al (2018) Deep learning classification of polygenic obesity using genome wide association study SNPs. In: 2018 International Joint Conference on Neural Networks (IJCNN). IEEE

  • Neagoe V-E, Neghina E-C (2016) Feature selection with ant colony optimization and its applications for pattern recognition in space imagery. In: 2016 international conference on communications (COMM). IEEE

  • Ng PC, Henikoff S (2001) Predicting deleterious amino acid substitutions. Genome Res 11(5):863–874

    Article  Google Scholar 

  • Nithya P, ChandraSekar A (2019) NBN gene analysis and it’s impact on breast cancer. J Med Syst 43(8):270

    Article  Google Scholar 

  • Ranganathan Ganakammal S, Alexov E (2020) An ensemble approach to predict the pathogenicity of synonymous variants. Genes 11(9):1102

    Article  Google Scholar 

  • Reva B, Antipin Y, Sander C (2011) Predicting the functional impact of protein mutations: application to cancer genomics. Nucleic Acids Res 39(17):e118–e118

    Article  Google Scholar 

  • Sachidanandam R et al (2001) A map of human genome sequence variation containing 142 million single nucleotide polymorphisms. Nature 409(6822):928–934

    Article  Google Scholar 

  • Schwarz JM et al (2014) MutationTaster2: mutation prediction for the deep-sequencing age. Nat Methods 11(4):361–362

    Article  Google Scholar 

  • Sherry ST et al (2001) dbSNP: the NCBI database of genetic variation. Nucleic Acids Res 29(1):308–311

    Article  Google Scholar 

  • Shihab HA et al (2013) Predicting the functional, molecular, and phenotypic consequences of amino acid substitutions using hidden Markov models. Hum Mutat 34(1):57–65

    Article  Google Scholar 

  • Sim N-L et al (2012) SIFT web server: predicting effects of amino acid substitutions on proteins. Nucleic Acids Res 40(W1):W452–W457

    Article  Google Scholar 

  • Sreeja N, Sankar A (2015) Pattern matching based classification using ant colony optimization based feature selection. Appl Soft Comput 31:91–102

    Article  Google Scholar 

  • Thusberg J, Vihinen M (2009) Pathogenic or not? And if so, then how? Studying the effects of missense mutations using bioinformatics methods. Hum Mutat 30(5):703–714

    Article  Google Scholar 

  • Wan Y et al (2016) A feature selection method based on modified binary coded ant colony optimization algorithm. Appl Soft Comput 49:248–258

    Article  Google Scholar 

  • Wang M, Wei L (2016) iFish: predicting the pathogenicity of human nonsynonymous variants using gene-specific/family-specific attributes and classifiers. Sci Rep 6(1):1–10

    MathSciNet  Google Scholar 

  • Xu Y et al (2020) Prediction of smoking behavior from single nucleotide polymorphisms with machine learning approaches. Front Psych 11:416

    Article  Google Scholar 

  • Ye Z-Q et al (2007) Finding new structural and sequence attributes to predict possible disease association of single amino acid polymorphism (SAP). Bioinformatics 23(12):1444–1450

    Article  Google Scholar 

  • Yip YL et al (2004) The Swiss-Prot variant page and the ModSNP database: a resource for sequence and structure information on human protein variants. Hum Mutat 23(5):464–470

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Enas M. F. El Houby.

Ethics declarations

Conflict of interest

The authors have no conflict of interests to declare.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

The original online version of this article was revised: “The missed pseudocode image has been added in the original article”.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

El Houby, E.M.F. Machine learning techniques for pathogenicity prediction of non-synonymous single nucleotide polymorphisms in human body. J Ambient Intell Human Comput 14, 8099–8113 (2023). https://doi.org/10.1007/s12652-021-03581-3

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s12652-021-03581-3

Keywords

Navigation