Skip to main content

Advertisement

Log in

Discriminating between deleterious and neutral non-frameshifting indels based on protein interaction networks and hybrid properties

  • Original Paper
  • Published:
Molecular Genetics and Genomics Aims and scope Submit manuscript

Abstract

More than ten thousand coding variants are contained in each human genome; however, our knowledge of the way genetic variants underlie phenotypic differences is far from complete. Small insertions and deletions (indels) are one of the most common types of human genetic variants, and indels play a significant role in human inherited disease. To date, we still lack a comprehensive understanding of how indels cause diseases. Therefore, identification and analysis of such deleterious variants is a key challenge and has been of great interest in the current research in genome biology. Increasing numbers of computational methods have been developed for discriminating between deleterious indels and neutral indels. However, most of the existing methods are based on traditional sequential or structural features, which cannot completely explain the association between indels and the resulting induced inherited disease. In this study, we establish a novel method to predict deleterious non-frameshifting indels based on features extracted from both protein interaction networks and traditional hybrid properties. Each indel was coded by 1,246 features. Using the maximum relevance minimum redundancy method and the incremental feature selection method, we obtained an optimal feature set containing 42 features, of which 21 features were derived from protein interaction networks. Based on the optimal feature set, an 88 % accuracy and a 0.76 MCC value were achieved by a Random Forest as evaluated by the Jackknife cross-validation test. This method outperformed existing methods of predicting deleterious indels, and can be applied in practice for deleterious non-frameshifting indel predictions in genome research. The analysis of the optimal features selected in the model revealed that network interactions play more important roles and could be informative for better illustrating an indel’s function and disease associations than traditional sequential or structural features. These results could shed some light on the genetic basis of human genetic variations and human inherited diseases.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4

Similar content being viewed by others

References

Download references

Acknowledgments

This work was supported by Grants from the National Basic Research Program of China (2011CB510102, 2011CB510101), the National Natural Science Foundation of China (61401302, 31371335, 81171342, 81201148), the Tianjin Research Program of the Application Foundation and Advanced Technology (14JCQNJC09500), the Innovation Program of the Shanghai Municipal Education Commission (12ZZ087), the National Research Foundation for the Doctoral Program of Higher Education of China (20130032120070, 20120032120073), the grant of ‘‘The First-class Discipline of Universities in Shanghai’’ and the Seed Foundation of Tianjin University (60302064, 60302069).

Conflict of interest

The authors declare that they have no conflict of interest.

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to Tao Huang or Yu-Dong Cai.

Additional information

Communicated by S. Hohmann.

Electronic supplementary material

Below is the link to the electronic supplementary material.

438_2014_922_MOESM1_ESM.txt

Supplementary material 1 (TXT 25,247 kb). Online Resource S1. The dataset used in this study. There are 2,479 deleterious samples (denoted as 1) and 2,413 neutral ones (denoted as 2). The first five columns are annotations as follows: the first column is the protein name, the second column is the mutation region, the third column is the original and mutated sequences (separated by “/”), the fourth column is the mutation type, either deletion or insertion (del/ins), and the fifth column is the effect of the mutation, either deleterious (1) or neutral (2). The features of each mutation start from the sixth column

438_2014_922_MOESM2_ESM.xls

Supplementary material 2 (XLS 122 kb). Online Resource S2. The mRMR table. The 1,246 features were ranked by mRMR scores. The top 42 features form the optimal feature set as determined by IFS

438_2014_922_MOESM3_ESM.xls

Supplementary material 3 (XLS 170 kb). Online Resource S3. The IFS results. Each classifier was constructed by adding 1 more feature from the mRMR table in the Online Resource S2. The prediction performances for all the classifiers are listed. The best performer is the classifier constructed using the top 42 features

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Zhang, N., Huang, T. & Cai, YD. Discriminating between deleterious and neutral non-frameshifting indels based on protein interaction networks and hybrid properties. Mol Genet Genomics 290, 343–352 (2015). https://doi.org/10.1007/s00438-014-0922-5

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00438-014-0922-5

Keywords

Navigation