Performance measures in evaluating machine learning based bioinformatics predictors for classifications

Jiao, Yasen; Du, Pufeng

doi:10.1007/s40484-016-0081-2

Performance measures in evaluating machine learning based bioinformatics predictors for classifications

Review
Published: 23 December 2016

Volume 4, pages 320–330, (2016)
Cite this article

Download PDF

Quantitative Biology

Performance measures in evaluating machine learning based bioinformatics predictors for classifications

Download PDF

Yasen Jiao¹ &
Pufeng Du¹

6955 Accesses
104 Citations
4 Altmetric
Explore all metrics

Abstract

Background

Many existing bioinformatics predictors are based on machine learning technology. When applying these predictors in practical studies, their predictive performances should be well understood. Different performance measures are applied in various studies as well as different evaluation methods. Even for the same performance measure, different terms, nomenclatures or notations may appear in different context.

Results

We carried out a review on the most commonly used performance measures and the evaluation methods for bioinformatics predictors.

Conclusions

It is important in bioinformatics to correctly understand and interpret the performance, as it is the key to rigorously compare performances of different predictors and to choose the right predictor.

References

Eberwine, J., Sul, J.-Y., Bartfai, T. and Kim, J. (2014) The promise of single-cell sequencing. Nat. Methods, 11, 25–27
Article PubMed CAS Google Scholar
Ashley, E. A. (2015) The precision medicine initiative: a new national effort. JAMA, 313, 2119–2120
Article PubMed CAS Google Scholar
Chou, K.-C. (2009) Pseudo amino acid composition and its applications in bioinformatics, proteomics and system biology. Curr. Proteomics, 6, 262–274
Article CAS Google Scholar
Chou, K.-C. (2015) Impacts of bioinformatics to medicinal chemistry. Med. Chem., 11, 218–234
Article PubMed CAS Google Scholar
Jiao, Y.-S. and Du, P.-F. (2016) Predicting Golgi-resident protein types using pseudo amino acid compositions: approaches with positional specific physicochemical properties. J. Theor. Biol., 391, 35–42
Article PubMed CAS Google Scholar
Wang, Y. and Zeng, J. (2013) Predicting drug-target interactions using restricted Boltzmann machines. Bioinformatics, 29, i126–i134
Article PubMed PubMed Central CAS Google Scholar
Lee, K., Byun, K., Hong,W., Chuang, H. Y., Pack, C. G., Bayarsaikhan, E., Paek, S. H., Kim, H., Shin, H. Y., Ideker, T., et al. (2013) Proteomewide discovery of mislocated proteins in cancer. Genome Res., 23, 1283–1294
Article PubMed PubMed Central CAS Google Scholar
Shao, J., Xu, D., Hu, L., Kwan, Y.W., Wang, Y., Kong, X. and Ngai, S. M. (2012) Systematic analysis of human lysine acetylation proteins and accurate prediction of human lysine acetylation through bi-relative adapted binomial score Bayes feature representation. Mol. Biosyst., 8, 2964–2973
Article PubMed CAS Google Scholar
Libbrecht, M. W. and Noble, W. S. (2015) Machine learning applications in genetics and genomics. Nat. Rev. Genet., 16, 321–332
Article PubMed PubMed Central CAS Google Scholar
Kohavi, R. and Provost, F. (1998) Glossary of terms. Mach. Learn., 30, 271–274
Article Google Scholar
Simon P. (2013) Too Big to Ignore: The Business Case for Big Data. New Jersey: Wiley
Fan, Y.-X., Zhang, Y. and Shen, H.-B. (2013) LabCaS: labeling calpain substrate cleavage sites from amino acid sequence using conditional random fields. Proteins, 81, 622–634
Article PubMed CAS Google Scholar
Song, J., Tan, H., Shen, H., Mahmood, K., Boyd, S. E., Webb, G. I., Akutsu, T. and Whisstock, J. C. (2010) Cascleave: towards more accurate prediction of caspase substrate cleavage sites. Bioinformatics, 26, 752–760
Article PubMed CAS Google Scholar
Chou, K.-C. and Shen, H.-B. (2008) Cell-PLoc: a package of Web servers for predicting subcellular localization of proteins in various organisms. Nat. Protoc., 3, 153–162
Article PubMed CAS Google Scholar
Li X, Liu T, Tao P, Wang, C., Chen, L. (2015) A highly accurate protein structural class prediction approach using auto cross covariance transformation and recursive feature elimination. Comput. Biol. Chem., 59, 95–100
Article PubMed CAS Google Scholar
Kong, L., Zhang, L. and Lv, J. (2014) Accurate prediction of protein structural classes by incorporating predicted secondary structure information into the general form of Chou’s pseudo amino acid composition. J. Theor. Biol., 344, 12–18
Article PubMed CAS Google Scholar
Guo, S.-H., Deng, E.-Z., Xu, L.-Q., Ding, H., Lin, H., Chen, W. and Chou, K. C. (2014) iNuc-PseKNC: a sequence-based predictor for predicting nucleosome positioning in genomes with pseudo k-tuple nucleotide composition. Bioinformatics, 30, 1522–1529
Article PubMed CAS Google Scholar
Xu, Y., Wen, X., Wen, L.-S., Wu, L. Y., Deng, N. Y. and Chou, K. C. (2014) iNitro-Tyr: prediction of nitrotyrosine sites in proteins with general pseudo amino acid composition. PLoS One, 9, e105018
Article CAS Google Scholar
Xu, Y. and Chou, K.-C. (2016) Recent progress in predicting posttranslational modification sites in proteins. Curr. Top. Med. Chem., 16, 591–603
Article PubMed CAS Google Scholar
Jiang, R., Tang, W., Wu, X. and Fu, W. (2009) A random forest approach to the detection of epistatic interactions in case-control studies. BMC Bioinformatics, 10, S65
Article PubMed PubMed Central CAS Google Scholar
Tang, W., Wu, X., Jiang, R. and Li, Y. (2009) Epistatic module detection for case-control studies: a Bayesian model with a Gibbs sampling strategy. PLoS Genet., 5, e1000464
Article CAS Google Scholar
Wu, X., Jiang, R., Zhang, M. Q. and Li, S. (2008) Network-based global inference of human disease genes. Mol. Syst. Biol., 4, 189
Article PubMed PubMed Central CAS Google Scholar
Li, T., Du, P. and Xu, N. (2010) Identifying human kinase-specific protein phosphorylation sites by integrating heterogeneous information from various sources. PLoS One, 5, e15411
Article CAS Google Scholar
Xue, Y., Liu, Z., Cao, J., Ma, Q., Gao, X., Wang, Q., Jin, C., Zhou, Y., Wen, L. and Ren, J. (2011) GPS 2.1: enhanced prediction of kinasespecific phosphorylation sites with an algorithm of motif length selection. Protein Eng. Des. Sel., 24, 255–260
Article PubMed CAS Google Scholar
Zhao, Q., Xie, Y., Zheng, Y., Jiang, S., Liu, W., Mu, W., Liu, Z., Zhao, Y., Xue, Y. and Ren, J. (2014) GPS-SUMO: a tool for the prediction of sumoylation sites and SUMO-interaction motifs. Nucleic Acids Res., 42, W325–W330
Article PubMed PubMed Central CAS Google Scholar
Nanni, L., Brahnam, S. and Lumini, A. (2012) Combining multiple approaches for gene microarray classification. Bioinformatics, 28, 1151–1157
Article PubMed CAS Google Scholar
Dong, X. and Weng, Z. (2013) The correlation between histone modifications and gene expression. Epigenomics, 5, 113–116
Article PubMed PubMed Central CAS Google Scholar
Dong, X., Greven, M. C., Kundaje, A., Djebali, S., Brown, J. B., Cheng, C., Gingeras, T. R., Gerstein, M., Guig, R., Birney, E., et al. (2012) Modeling gene expression using chromatin features in various cellular contexts. Genome Biol., 13, R53
Article PubMed PubMed Central CAS Google Scholar
Cheng, C., Shou, C., Yip, K. Y. and Gerstein, M. B. (2011) Genomewide analysis of chromatin features identifies histone modification sensitive and insensitive yeast transcription factors. Genome Biol., 12, R111
Article PubMed PubMed Central CAS Google Scholar
Huang, J., Marco, E., Pinello, L. and Yuan, G. C. (2015) Predicting chromatin organization using histone marks. Genome Biol., 16, 162
Article PubMed PubMed Central CAS Google Scholar
Bishop CM. (2006) Pattern Recognition and Machine Learning. New York: Springer
Google Scholar
Zhang, M.-L. and Zhou, Z.-H. (2007) ML-KNN: a lazy learning approach to multi-label learning. Pattern Recognit., 40, 2038–2048
Article Google Scholar
Chou, K.-C. (2013) Some remarks on predicting multi-label attributes in molecular biosystems. Mol. Biosyst., 9, 1092–1100
Article PubMed CAS Google Scholar
Chou, K.-C. and Shen, H.-B. (2006) Hum-PLoc: a novel ensemble classifier for predicting human protein subcellular localization. Biochem. Biophys. Res. Commun., 347, 150–157
Article PubMed CAS Google Scholar
Chou, K.-C., Wu, Z.-C. and Xiao, X. (2012) iLoc-Hum: using the accumulation-label scale to predict subcellular locations of human proteins with both single and multiple sites. Mol. Biosyst., 8, 629–641
Article PubMed CAS Google Scholar
Du, P. and Li, Y. (2006) Prediction of protein submitochondria locations by hybridizing pseudo-amino acid composition with various physicochemical features of segmented sequence. BMC Bioinformatics, 7, 518
Article PubMed PubMed Central CAS Google Scholar
Du, P., Tian, Y. and Yan, Y. (2012) Subcellular localization prediction for human internal and organelle membrane proteins with projected gene ontology scores. J. Theor. Biol., 313, 61–67
Article PubMed CAS Google Scholar
Lin, H., Deng, E.-Z., Ding, H., Chen, W. and Chou, K. C. (2014) iPro54-PseKNC: a sequence-based predictor for identifying sigma-54 promoters in prokaryote with pseudo k-tuple nucleotide composition. Nucleic Acids Res., 42, 12961–12972
Article PubMed PubMed Central CAS Google Scholar
Chou, K.-C. (2011) Some remarks on protein attribute prediction and pseudo amino acid composition. J. Theor. Biol., 273, 236–247
Article PubMed CAS Google Scholar
Chou, K. C. and Zhang, C. T. (1995) Prediction of protein structural classes. Crit. Rev. Biochem. Mol. Biol., 30, 275–349
Article PubMed CAS Google Scholar
Du, P., Li, T. andWang, X. (2011) Recent progress in predicting protein sub-subcellular locations. Expert Rev. Proteomics, 8, 391–404
Article PubMed CAS Google Scholar
Hastie, T., Tibshirani, R. and Friedman, J. (2009) Model Assessment and Selection. In The Elements of Statistical Learning: Data Mining, Inference, and Prediction. 219–260, New York: Springer-Verlag
Chapter Google Scholar
Chou, K. C. (2001) Using subsite coupling to predict signal peptides. Protein Eng., 14, 75–79
Article PubMed CAS Google Scholar
Chen, W., Feng, P., Ding, H., Lin, H. and Chou, K. C. (2015) iRNAMethyl: identifying N(6)-methyladenosine sites using pseudo nucleotide composition. Anal. Biochem., 490, 26–33
Article PubMed CAS Google Scholar
Powers, D. M. W. (2011) Evaluation: from precision, recall and F-measure to ROC, informedness, markedness and correlation. Inter. J. Mach. Learn. Tech., 2, 37–63
Article Google Scholar
Li, J., Witten, D. M., Johnstone, I. M. and Tibshirani, R. (2012) Normalization, testing, and false discovery rate estimation for RNAsequencing data. Biostatistics, 13, 523–538
Article PubMed Google Scholar
Andreassen, O. A., Thompson, W. K., Schork, A. J., Ripke, S., Mattingsdal, M., Kelsoe, J. R., Kendler, K. S., O’Donovan, M. C., Rujescu, D., Werge, T., et al. (2013) Improved detection of common variants associated with schizophrenia and bipolar disorder using pleiotropy-informed conditional false discovery rate. PLoS Genet., 9, e1003455
Article PubMed PubMed Central CAS Google Scholar
Chen, J. J., Roberson, P. K. and Schell, M. J. (2010) The false discovery rate: a key concept in large-scale genetic studies. Cancer Control, 17, 58–62
Article PubMed Google Scholar
Brodersen, K. H., Ong, C. S., Stephan, K. E., Buhmann, J. M. (2010) The Balanced Accuracy and Its Posterior Distribution. In 2010 20th International Conference on Pattern Recognition (ICPR). 3121–3124
Chapter Google Scholar
Mower, J. P. (2005) PREP-Mt: predictive RNA editor for plant mitochondrial genes. BMC Bioinformatics, 6, 96
Article PubMed PubMed Central CAS Google Scholar
Dayarian, A., Romero, R., Wang, Z., Biehl, M., Bilal, E., Hormoz, S., Meyer, P., Norel, R., Rhrissorrakrai, K., Bhanot, G., et al. (2015) Predicting protein phosphorylation from gene expression: top methods from the IMPROVER Species Translation Challenge. Bioinformatics, 31, 462–470
Article PubMed CAS Google Scholar
Matthews, B. W. (1975) Comparison of the predicted and observed secondary structure of T4 phage lysozyme. BBA–Protein Structure, 405, 442–451
Article CAS Google Scholar
Saito, T. and Rehmsmeier, M. (2015) The precision-recall plot is more informative than the ROC plot when evaluating binary classifiers on imbalanced datasets. PLoS One, 10, e0118432
Google Scholar
Davis, J. and Goadrich, M. (2006) The relationship between precisionrecall and ROC curves. In Proceedings of the 23rd international conference on Machine learning. 233–240, New York: the Association for Computing Machinery
Google Scholar
Du, P. and Xu, C. (2013) Predicting multisite protein subcellular locations: progress and challenges. Expert Rev. Proteomics, 10, 227–237
Article PubMed CAS Google Scholar
Tsoumakas, G., Katakis, I. and Vlahavas, I. (2010) Mining Multi-label Data. In Data Mining and Knowledge Discovery Handbook. 667–685, New York: Springer US
Google Scholar
Tsoumakas, G. and Katakis, I. (2007) Multi-label classification: an overview. Int. J. Data Warehous. Min., 3, 1–13
Article Google Scholar
Sprenger, J., Fink, J. L. and Teasdale, R. D. (2006) Evaluation and comparison of mammalian subcellular localization prediction methods. BMC Bioinformatics, 7, S3
Article PubMed PubMed Central CAS Google Scholar
Bermingham, M. L., Pong-Wong, R., Spiliopoulou, A., Hayward, C., Rudan, I., Campbell, H., Wright, A. F., Wilson, J. F., Agakov, F., Navarro, P., et al. (2015) Application of high-dimensional feature selection: evaluation for genomic prediction in man. Sci. Rep., 5, 10312
Article PubMed PubMed Central CAS Google Scholar
Varma, S. and Simon, R. (2006) Bias in error estimation when using cross-validation for model selection. BMC Bioinformatics, 7, 91
Article PubMed PubMed Central CAS Google Scholar

Download references

Author information

Authors and Affiliations

School of Computer Science and Technology, Tianjin University, Tianjin, 300350, China
Yasen Jiao & Pufeng Du

Authors

Yasen Jiao
View author publications
You can also search for this author in PubMed Google Scholar
Pufeng Du
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Pufeng Du.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Jiao, Y., Du, P. Performance measures in evaluating machine learning based bioinformatics predictors for classifications. Quant Biol 4, 320–330 (2016). https://doi.org/10.1007/s40484-016-0081-2

Download citation

Received: 08 June 2016
Revised: 06 September 2016
Accepted: 21 October 2016
Published: 23 December 2016
Issue Date: December 2016
DOI: https://doi.org/10.1007/s40484-016-0081-2

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Performance measures in evaluating machine learning based bioinformatics predictors for classifications