Molecular Genetics and Genomics

, Volume 290, Issue 1, pp 343–352 | Cite as

Discriminating between deleterious and neutral non-frameshifting indels based on protein interaction networks and hybrid properties

  • Ning Zhang
  • Tao Huang
  • Yu-Dong Cai
Original Paper


More than ten thousand coding variants are contained in each human genome; however, our knowledge of the way genetic variants underlie phenotypic differences is far from complete. Small insertions and deletions (indels) are one of the most common types of human genetic variants, and indels play a significant role in human inherited disease. To date, we still lack a comprehensive understanding of how indels cause diseases. Therefore, identification and analysis of such deleterious variants is a key challenge and has been of great interest in the current research in genome biology. Increasing numbers of computational methods have been developed for discriminating between deleterious indels and neutral indels. However, most of the existing methods are based on traditional sequential or structural features, which cannot completely explain the association between indels and the resulting induced inherited disease. In this study, we establish a novel method to predict deleterious non-frameshifting indels based on features extracted from both protein interaction networks and traditional hybrid properties. Each indel was coded by 1,246 features. Using the maximum relevance minimum redundancy method and the incremental feature selection method, we obtained an optimal feature set containing 42 features, of which 21 features were derived from protein interaction networks. Based on the optimal feature set, an 88 % accuracy and a 0.76 MCC value were achieved by a Random Forest as evaluated by the Jackknife cross-validation test. This method outperformed existing methods of predicting deleterious indels, and can be applied in practice for deleterious non-frameshifting indel predictions in genome research. The analysis of the optimal features selected in the model revealed that network interactions play more important roles and could be informative for better illustrating an indel’s function and disease associations than traditional sequential or structural features. These results could shed some light on the genetic basis of human genetic variations and human inherited diseases.


Indel Disease Network feature Random forest Incremental feature selection 



This work was supported by Grants from the National Basic Research Program of China (2011CB510102, 2011CB510101), the National Natural Science Foundation of China (61401302, 31371335, 81171342, 81201148), the Tianjin Research Program of the Application Foundation and Advanced Technology (14JCQNJC09500), the Innovation Program of the Shanghai Municipal Education Commission (12ZZ087), the National Research Foundation for the Doctoral Program of Higher Education of China (20130032120070, 20120032120073), the grant of ‘‘The First-class Discipline of Universities in Shanghai’’ and the Seed Foundation of Tianjin University (60302064, 60302069).

Conflict of interest

The authors declare that they have no conflict of interest.

Supplementary material

438_2014_922_MOESM1_ESM.txt (24.7 mb)
Supplementary material 1 (TXT 25,247 kb). Online Resource S1. The dataset used in this study. There are 2,479 deleterious samples (denoted as 1) and 2,413 neutral ones (denoted as 2). The first five columns are annotations as follows: the first column is the protein name, the second column is the mutation region, the third column is the original and mutated sequences (separated by “/”), the fourth column is the mutation type, either deletion or insertion (del/ins), and the fifth column is the effect of the mutation, either deleterious (1) or neutral (2). The features of each mutation start from the sixth column
438_2014_922_MOESM2_ESM.xls (122 kb)
Supplementary material 2 (XLS 122 kb). Online Resource S2. The mRMR table. The 1,246 features were ranked by mRMR scores. The top 42 features form the optimal feature set as determined by IFS
438_2014_922_MOESM3_ESM.xls (170 kb)
Supplementary material 3 (XLS 170 kb). Online Resource S3. The IFS results. Each classifier was constructed by adding 1 more feature from the mRMR table in the Online Resource S2. The prediction performances for all the classifiers are listed. The best performer is the classifier constructed using the top 42 features


  1. Ahmad S, Sarai A (2005) PSSM-based prediction of DNA binding sites in proteins. BMC Bioinform 6:33. doi: 10.1186/1471-2105-6-33 CrossRefGoogle Scholar
  2. Akagi K, Stephens RM, et al (2010) MouseIndelDB: a database integrating genomic indel polymorphisms that distinguish mouse strains. Nucleic acids research 38(Database issue):D600–D606. doi  10.1093/nar/gkp1046
  3. Altschul SF, Madden TL et al (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res 25(17):3389–3402PubMedCentralPubMedCrossRefGoogle Scholar
  4. Atchley WR, Zhao J et al (2005) Solving the protein sequence metric problem. Proc Natl Acad Sci USA 102(18):6395–6400. doi: 10.1073/pnas.0408677102 PubMedCentralPubMedCrossRefGoogle Scholar
  5. Bi XH, Lu CM et al (2012) A 14 bp indel variation in the NCX1 gene modulates the age at onset in late-onset Alzheimer’s disease. J Neural Transm 119(3):383–386. doi: 10.1007/s00702-011-0696-4 PubMedCrossRefGoogle Scholar
  6. Breiman L (2001) Random forests. Mach Learn 45(1):5–32. doi: 10.1023/A:1010933404324 CrossRefGoogle Scholar
  7. Cai YD, Huang T et al (2010) A unified 35-gene signature for both subtype classification and survival prediction in diffuse large B-cell lymphomas. PLoS ONE 5(9):e12726. doi: 10.1371/journal.pone.0012726 PubMedCentralPubMedCrossRefGoogle Scholar
  8. Cai Y, Huang T et al (2012) Prediction of lysine ubiquitination with mRMR feature selection and analysis. Amino Acids 42(4):1387–1395. doi: 10.1007/s00726-011-0835-0 PubMedCrossRefGoogle Scholar
  9. Chan SK, Hsing M et al (2007) Relationship between insertion/deletion (indel) frequency of proteins and essentiality. BMC Bioinform 8:227. doi: 10.1186/1471-2105-8-227 CrossRefGoogle Scholar
  10. Choi Y, Sims GE et al (2012) Predicting the functional effect of amino acid substitutions and indels. PLoS ONE 7(10):e46688. doi: 10.1371/journal.pone.0046688 PubMedCentralPubMedCrossRefGoogle Scholar
  11. Dong B, Chen J et al (2013) Two novel PRP31 premessenger ribonucleic acid processing factor 31 homolog mutations including a complex insertion-deletion identified in Chinese families with retinitis pigmentosa. Mol Vision 19:2426–2435Google Scholar
  12. Frappier V, Najmanovich RJ (2014) A coarse-grained elastic network atom contact model and its use in the simulation of protein dynamics and the prediction of the effect of mutations. PLoS Comput Biol 10(4):e1003569. doi: 10.1371/journal.pcbi.1003569 PubMedCentralPubMedCrossRefGoogle Scholar
  13. Frousios K, Iliopoulos CS et al (2013) Predicting the functional consequences of non-synonymous DNA sequence variants—evaluation of bioinformatics tools and development of a consensus strategy. Genomics 102(4):223–228. doi: 10.1016/j.ygeno.2013.06.005 PubMedCrossRefGoogle Scholar
  14. Glanzmann B, Lombard D et al (2014) Screening of two indel polymorphisms in the 5′UTR of the DJ-1 gene in South African Parkinson’s disease patients. J Neural Transm 121(2):135–138. doi: 10.1007/s00702-013-1094-x PubMedCrossRefGoogle Scholar
  15. Grimm D, Hagmann J et al (2013) Accurate indel prediction using paired-end short reads. BMC Genom 14:132. doi: 10.1186/1471-2164-14-132 CrossRefGoogle Scholar
  16. He Z, Zhang J et al (2010) Predicting drug-target interaction networks based on functional groups and biological features. PLoS ONE 5(3):e9603. doi: 10.1371/journal.pone.0009603 PubMedCentralPubMedCrossRefGoogle Scholar
  17. Hsing M, Cherkasov A (2008) Indel PDB: a database of structural insertions and deletions derived from sequence alignments of closely related proteins. BMC Bioinform 9:293. doi: 10.1186/1471-2105-9-293 CrossRefGoogle Scholar
  18. Hu J, Ng P (2012) Predicting the effects of frameshifting indels. Genome Biol 13(2):R9PubMedCentralPubMedCrossRefGoogle Scholar
  19. Hu J, Ng PC (2013) SIFT Indel: predictions for the functional effects of amino acid insertions/deletions in proteins. PLoS ONE 8(10):e77940. doi: 10.1371/journal.pone.0077940 PubMedCentralPubMedCrossRefGoogle Scholar
  20. Huang T, Cai Y-D (2013) An information-theoretic machine learning approach to expression QTL analysis. PLoS ONE 8(6):e67899PubMedCentralPubMedCrossRefGoogle Scholar
  21. Huang T, Cui W et al (2009) Prediction of pharmacological and xenobiotic responses to drugs based on time course gene expression profiles. PLoS ONE 4(12):e8126. doi: 10.1371/journal.pone.0008126 PubMedCentralPubMedCrossRefGoogle Scholar
  22. Huang T, Shi XH et al (2010a) Analysis and prediction of the metabolic stability of proteins based on their sequential features, subcellular locations and interaction networks. PLoS ONE 5(6):e10972. doi: 10.1371/journal.pone.0010972 PubMedCentralPubMedCrossRefGoogle Scholar
  23. Huang T, Wang P et al (2010b) Prediction of deleterious non-synonymous SNPs based on protein interaction network and hybrid properties. PLoS ONE 5(7):e11900. doi: 10.1371/journal.pone.0011900 PubMedCentralPubMedCrossRefGoogle Scholar
  24. Huang T, Chen L et al (2011a) Classification and analysis of regulatory pathways using graph property, biochemical and physicochemical property, and functional property. PLoS ONE 6(9):e25297. doi: 10.1371/journal.pone.0025297 PubMedCentralPubMedCrossRefGoogle Scholar
  25. Huang T, Niu S et al (2011b) Predicting transcriptional activity of multiple site p53 mutants based on hybrid properties. PLoS ONE 6(8):e22940. doi: 10.1371/journal.pone.0022940 PubMedCentralPubMedCrossRefGoogle Scholar
  26. Huang T, Wan S et al (2011c) Analysis and prediction of translation rate based on sequence and functional features of the mRNA. PLoS ONE 6(1):e16036. doi: 10.1371/journal.pone.0016036 PubMedCentralPubMedCrossRefGoogle Scholar
  27. Huang T, Xu Z et al (2011d) Computational analysis of HIV-1 resistance based on gene expression profiles and the virus-host interaction network. PLoS ONE 6(3):e17291. doi: 10.1371/journal.pone.0017291 PubMedCentralPubMedCrossRefGoogle Scholar
  28. Huang T, Wang C et al (2012a) SySAP: a system-level predictor of deleterious single amino acid polymorphisms. Protein Cell 3(1):38–43. doi: 10.1007/s13238-011-1130-2 PubMedCrossRefGoogle Scholar
  29. Huang T, Wang J et al (2012b) Hepatitis C virus network based classification of hepatocellular cirrhosis and carcinoma. PLoS ONE 7(4):e34460. doi: 10.1371/journal.pone.0034460 PubMedCentralPubMedCrossRefGoogle Scholar
  30. Huang T, Zhang J et al (2012c) Deciphering the effects of gene deletion on yeast longevity using network and machine learning approaches. Biochimie 94(4):1017–1025. doi: 10.1016/j.biochi.2011.12.024 PubMedCrossRefGoogle Scholar
  31. Huang T, He ZS et al (2013) A sequence-based approach for predicting protein disordered regions. Protein Pept Lett 20(3):243–248PubMedGoogle Scholar
  32. Jia SC, Hu XZ (2011) Using random forest algorithm to predict beta-hairpin motifs. Protein Pept Lett 18(6):609–617PubMedCrossRefGoogle Scholar
  33. Jiang Y, Huang T et al (2013) Signal propagation in protein interaction network during colorectal cancer progression. Biomed Res Int 2013:287019. doi: 10.1155/2013/287019 PubMedCentralPubMedGoogle Scholar
  34. Jones D (2008) Pathways to cancer therapy. Nat Rev Drug Discovery 7(11):875–876. doi: 10.1038/nrd2748 CrossRefGoogle Scholar
  35. Jones S, Zhang X et al (2008) Core signaling pathways in human pancreatic cancers revealed by global genomic analyses. Science 321(5897):1801–1806. doi: 10.1126/science.1164368 PubMedCentralPubMedCrossRefGoogle Scholar
  36. Kandaswamy KK, Chou KC et al (2011) AFP-Pred: a random forest approach for predicting antifreeze proteins from sequence-derived properties. J Theor Biol 270(1):56–62. doi: 10.1016/j.jtbi.2010.10.037 PubMedCrossRefGoogle Scholar
  37. Kawashima S, Kanehisa M (2000) AAindex: amino acid index database. Nucleic Acids Res 28(1):374PubMedCentralPubMedCrossRefGoogle Scholar
  38. Li BQ, Feng KY et al (2012) Prediction of protein-protein interaction sites by random forest algorithm with mRMR and IFS. PLoS ONE 7(8):e43927. doi: 10.1371/journal.pone.0043927 PubMedCentralPubMedCrossRefGoogle Scholar
  39. Li Z, Li BQ et al (2013) Prediction and analysis of retinoblastoma related genes through gene ontology and KEGG. Biomed Res Int 2013:304029. doi: 10.1155/2013/304029 PubMedCentralPubMedGoogle Scholar
  40. Lin WZ, Fang JA et al (2011) iDNA-Prot: identification of DNA binding proteins using random forest with grey model. PLoS ONE 6(9):e24756. doi: 10.1371/journal.pone.0024756 PubMedCentralPubMedCrossRefGoogle Scholar
  41. Niu S, Huang T et al (2010) Prediction of tyrosine sulfation with mRMR feature selection and analysis. J Proteome Res 9(12):6490–6497. doi: 10.1021/pr1007152 PubMedCrossRefGoogle Scholar
  42. Niu S, Huang T et al (2013) Inter- and intra-chain disulfide bond prediction based on optimal feature selection. Protein Pept Lett 20(3):324–335PubMedGoogle Scholar
  43. Peng H, Long F et al (2005) Feature selection based on mutual information: criteria of max-dependency, max-relevance, and min-redundancy. IEEE Trans Pattern Anal Mach Intell 27(8):1226–1238. doi: 10.1109/TPAMI.2005.159 PubMedCrossRefGoogle Scholar
  44. Peng K, Radivojac P et al (2006) Length-dependent prediction of protein intrinsic disorder. BMC Bioinform 7:208. doi: 10.1186/1471-2105-7-208 CrossRefGoogle Scholar
  45. Rogers J, Gunn S (2006) Identifying feature relevance using a random forest. Lect Notes Comput Sc 3940:173–184CrossRefGoogle Scholar
  46. Rokas A, Holland PW (2000) Rare genomic changes as a tool for phylogenetics. Trends Ecol Evol 15(11):454–459PubMedCrossRefGoogle Scholar
  47. Ross JS, Wang K et al (2014) Advanced urothelial carcinoma: next-generation sequencing reveals diverse genomic alterations and targets of therapy. Mod Pathol 27(2):271–280. doi: 10.1038/modpathol.2013.135 PubMedCrossRefGoogle Scholar
  48. Shihab HA, Gough J et al (2013) Predicting the functional consequences of cancer-associated amino acid substitutions. Bioinformatics 29(12):1504–1510. doi: 10.1093/bioinformatics/btt182 PubMedCentralPubMedCrossRefGoogle Scholar
  49. Sickmeier M, Hamilton JA et al (2007) DisProt: the database of disordered proteins. Nucleic Acids Res 35:D786–D793. doi: 10.1093/Nar/Gkl893 PubMedCentralPubMedCrossRefGoogle Scholar
  50. Sim NL, Kumar P et al (2012) SIFT web server: predicting effects of amino acid substitutions on proteins. Nucleic Acids Res 40(Web Server issue):W452–W457. doi  10.1093/nar/gks539 PubMedCentralPubMedCrossRefGoogle Scholar
  51. Stenson PD, Mort M et al (2009) The human gene mutation database: 2008 update. Genome Med 1(1):13. doi: 10.1186/gm13 PubMedCentralPubMedCrossRefGoogle Scholar
  52. Tennessen JA, Bigham AW et al (2012) Evolution and functional impact of rare coding variation from deep sequencing of human exomes. Science 337(6090):64–69. doi: 10.1126/science.1219240 PubMedCentralPubMedCrossRefGoogle Scholar
  53. Wagner A (2003) How the global structure of protein interaction networks evolves. Proc Biol Sci R Soc 270(1514):457–466. doi: 10.1098/rspb.2002.2269 CrossRefGoogle Scholar
  54. Wang M, Zhao XM et al (2012) FunSAV: predicting the functional effect of single amino acid variants using a two-stage random forest model. PLoS ONE 7(8):e43847. doi: 10.1371/journal.pone.0043847 PubMedCentralPubMedCrossRefGoogle Scholar
  55. Wang M, Sun Z et al (2013) Recent advances in predicting functional impact of single amino acid polymorphisms: a review of useful features, computational methods and available tools. Curr Bioinform 8(2):161–176CrossRefGoogle Scholar
  56. Yu Q, Zhou C et al (2013) A functional insertion/deletion polymorphism in the promoter of PDCD6IP is associated with the susceptibility of hepatocellular carcinoma in a Chinese population. DNA Cell Biol 32(8):451–457. doi: 10.1089/dna.2013.2061 PubMedCrossRefGoogle Scholar
  57. Zhang N, Li BQ et al (2012) Computational prediction and analysis of protein gamma-carboxylation sites based on a random forest method. Mol BioSyst 8(11):2946–2955. doi: 10.1039/c2mb25185j PubMedCrossRefGoogle Scholar
  58. Zhao H, Yang Y et al (2013) DDIG-in: discriminating between disease-associated and neutral non-frameshifting micro-indels. Genome Biol 14(3):R23. doi: 10.1186/gb-2013-14-3-r23 PubMedCentralPubMedCrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2014

Authors and Affiliations

  1. 1.Institute of Systems BiologyShanghai UniversityShanghaiPeople’s Republic of China
  2. 2.Department of Biomedical Engineering, Tianjin Key Lab of BME MeasurementTianjin UniversityTianjinPeople’s Republic of China
  3. 3.Department of Genetics and Genomic SciencesIcahn School of Medicine at Mount SinaiNew YorkUSA

Personalised recommendations