Abstract
Proteomics is a field dedicated to the analysis of proteins in cells, tissues, and organisms, aiming to gain insights into their structures, functions, and interactions. A crucial aspect within proteomics is protein family prediction, which involves identifying evolutionary relationships between proteins by examining similarities in their sequences or structures. This approach holds great potential for applications such as drug discovery and functional annotation of genomes. However, current methods for protein family prediction have certain limitations, including limited accuracy, high false positive rates, and challenges in handling large datasets. Some methods also rely on homologous sequences or protein structures, which introduce biases and restrict their applicability to specific protein families or structures. To overcome these limitations, researchers have turned to machine learning (ML) approaches that can identify connections between protein features and simplify complex high-dimensional datasets. This paper presents a comprehensive survey of articles that employ various ML techniques for predicting protein families. The primary objective is to explore and improve ML techniques specifically for protein family prediction, thus advancing future research in the field. Through qualitative and quantitative analyses of ML techniques, it is evident that multiple methods utilizing a range of classifiers have been applied for protein family prediction. However, there has been limited focus on developing novel classifiers for protein family classification, highlighting the urgent need for improved approaches in this area. By addressing these challenges, this research aims to enhance the accuracy and effectiveness of protein family prediction, ultimately facilitating advancements in proteomics and its diverse applications.
Similar content being viewed by others
Data Availability
Data will be made available based on the request.
Abbreviations
- AUC:
-
Area Under Curve
- AUPR:
-
Area Under Precision Recall Curve
- BLAST:
-
Basic Local Alignment Search Tool
- DT:
-
Decision Tree
- ELM:
-
Extreme Learning Machines
- FN:
-
False Negatives
- FP:
-
False Positives
- FPR:
-
False Positive Rate
- GO:
-
Gene Ontology
- GPCR:
-
G-protein Coupled Receptors
- GRU:
-
Gate Recurrent Unit
- HMMs:
-
Hidden Markov Models
- KNN:
-
K- Nearest Neighbor
- MCC:
-
Mathew’s Correlation Coefficient
- ML:
-
Machine Learning
- MLP:
-
Multilayer Perceptron
- NB:
-
Naive Bayes
- Net Go:
-
Network information based Go
- NMR:
-
Nuclear Magnetic Resonance
- PPI:
-
Protein-Protein interaction
- PseAAC:
-
Pseudo Amino Acid Composition
- RF:
-
Random Forest
- SVM:
-
Support Vector Machine
- TN:
-
True Negatives
- TP:
-
True Positives
References
Al-Amrani S, Al-Jabri Z, Al-Zaabi A, Alshekaili J, Al-Khabori M (2021) Proteomics: concepts and applications in human medicine. World J Biol Chem 12(5):57–69. https://doi.org/10.4331/wjbc.v12.i5.57
Bonetta R, Valentino G (2020) Machine learning techniques for protein function prediction. Proteins 88(3):397–413. https://doi.org/10.1002/prot.25832
Zalewski JK, Heber S, Mo JH, O’Conor K, Hildebrand JD, VanDemark AP (2017) Combining wet and dry lab techniques to guide the crystallization of large coiled-coil containing proteins. J visualized experiments: JoVE. https://doi.org/10.3791/54886
Koonin EV, Galperin MY (2003) Sequence-evolution - function: computational approaches in comparative genomics. boston: Kluwer Academic; principles and methods of sequence analysis. Available from: https://www.ncbi.nlm.nih.gov/books/NBK20261
Zehetner G (2003) Ontoblast function: from sequence similarities directly to potential functional annotations by ontology terms. Nucleic Acids Res 31(13):3799–3803. https://doi.org/10.1093/nar/gkg555
Groth D, Lehrach H, Hennig S (2004) GOblet: a platform for gene ontology annotation of anonymous sequence data. Nucleic Acids Res 32:W313–W317. https://doi.org/10.1093/nar/gkh406
Martin DM, Berriman M, Barton GJ (2004) GOtcha: a new method for prediction of protein function assessed by the annotation of seven genomes. BMC Bioinformatics 5:178. https://doi.org/10.1186/1471-2105-5-178
Clark WT, Radivojac P (2011) Analysis of protein function and its prediction from amino acid sequence. Proteins 79(7):2086–2096. https://doi.org/10.1002/prot.23029
Rentzsch R, Orengo CA (2013) Protein function prediction using domain families. BMC Bioinform. https://doi.org/10.1186/1471-2105-14-S3-S5
Cozzetto D, Buchan DW, Bryson K, Jones DT (2013) Protein function prediction by massive integration of evolutionary analyses and multiple data sources. BMC Bioinform. https://doi.org/10.1186/1471-2105-14-S3-S1
Piovesan D, Giollo M, Leonardi E, Ferrari C, Tosatto SC (2015) INGA: protein function prediction combining interaction networks, domain assignments and sequence similarity. Nucleic Acids Res 43(W1):W134–W140. https://doi.org/10.1093/nar/gkv523
Ronghui You W (2019) NetGO: improving large-scale protein function prediction with massive network information. Nucl Acids Res. https://doi.org/10.1093/nar/gkz388
Deng M, Zhang K, Mehta S, Chen T, Sun F (2002), August prediction of protein function using protein-protein interaction data. In Proceedings. IEEE Computer Society Bioinformatics Conference (pp. 197–206).
Vinga S, Almeida J (2003) Alignment-free sequence comparison—a review. Bioinformatics 19(4):513–523
Letovsky S, Kasif S (2003) Predicting protein function from protein/protein interaction data: a probabilistic approach. Bioinformatics 19(suppl1):i197–i204
Lingner T, Meinicke P (2006) Remote homology detection based on oligomer distances. Bioinformatics 22(18):2224–2231
Chua HN, Sung WK, Wong L (2006) Exploiting indirect neighbours and topological weight to predict protein function from protein–protein interactions. Bioinformatics 22(13):1623–1630
Chou KC, Elrod DW (2003) Prediction of enzyme family classes. J Proteome Res 2(2):183–190. https://doi.org/10.1021/pr0255710
Cai CZ, Han LY, Ji ZL, Chen X, Chen YZ (2003) SVM-Prot: web-based support vector machine software for functional classification of a protein from its primary sequence. Nucl Acids Res 31(13):3692–3697. https://doi.org/10.1093/nar/gkg600
Cai CZ, Han LY, Ji ZL, Chen YZ (2004) Enzyme family classification by support vector machines. Proteins 55(1):66–76. https://doi.org/10.1002/prot.20045
Bhasin M, Raghava GP (2004) Classification of nuclear receptors based on amino acid composition and dipeptide composition. J Biol Chem 279(22):23262–23266. https://doi.org/10.1074/jbc.M401932200
Bhasin M, Raghava GP (2004) GPCRpred: an SVM-based method for prediction of families and subfamilies of G-protein coupled receptors. Nucleic Acids Res. https://doi.org/10.1093/nar/gkh416
Cai YD, Chou KC (2005) Using functional domain composition to predict enzyme family classes. J Proteome Res 4(1):109–111. https://doi.org/10.1021/pr049835p
Ong SA, Lin HH, Chen YZ et al (2007) Efficacy of different protein descriptors in predicting protein functional families. BMC Bioinform 8:300. https://doi.org/10.1186/1471-2105-8-300
Zhu F, Han LY, Chen X, Lin HH, Ong S, Xie B, Zhang HL, Chen YZ (2008) Homology-free prediction of functional class of proteins and peptides by support vector machines. Curr Protein Pept Sci 9(1):70–95. https://doi.org/10.2174/138920308783565697
Peng ZL, Yang JY, Chen X (2010) An improved classification of G-protein-coupled receptors using sequence-derived features. BMC Bioinform 11:420. https://doi.org/10.1186/1471-2105-11-420
Cao J, Xiong L (2014) Protein sequence classification with improved extreme learning machine algorithms. Biomed Res Int 2014:1–12. https://doi.org/10.1155/2014/103054
Iqbal MJ, Faye I, Samir BB, Said M (2014) Efficient feature selection and classification of protein sequence data in bioinformatics. Sci World J. https://doi.org/10.1155/2014/173869
Zhong J, Wang J, Peng W, Zhang Z, Li M (2015) A feature selection method for prediction essential protein. Tsinghua Sci Technol 20(5):491–499. https://doi.org/10.1109/tst.2015.7297748
Lee, Nguyen (2018) Protein family classification with neural network, Stanford University, https://cs224d.stanford.edu/reports/LeeNguyen.pdf
Tan JX, Lv H, Wang F, Dao FY, Chen W, Ding H (2019) A survey for predicting enzyme family classes using machine learning methods. Curr Drug Targets 20(5):540–550. https://doi.org/10.2174/1389450119666181002143355
Han K, Wang M, Zhang L, Wang Y, Guo M, Zhao M, Wang C (2019) Predicting ion channels genes and their types with machine learning techniques. Front Genet. https://doi.org/10.3389/fgene.2019.00399
Zhang L, Dong B, Teng Z, Zhang Y, Juan L (2020) Identification of human enzymes using amino acid composition and the composition of k-Spaced amino acid pairs. Biomed Res Int 2020:1–11. https://doi.org/10.1155/2020/9235920
Siddha SS (2020) Protein sequence classification using machine learning, research project, National College of Ireland, https://norma.ncirl.ie/4472/1/shravaneeshekharsiddha.pdf
Hakala K, Kaewphan S, Bjorne J, Mehryary F, Moen H, Tolvanen M, Salakoski T, Ginter F (2022) Neural network and random forest models in protein function prediction. IEEE/ACM Trans Comput Biol Bioinf 19(3):1772–1781. https://doi.org/10.1109/TCBB.2020.3044230
Li Y, Zhang Z, Teng Z, Liu X (2020) PredAmyl-MLP: prediction of amyloid proteins using multilayer perceptron. Comput Math Methods Med. https://doi.org/10.1155/2020/8845133
Kabir MN, Wong L (2022) EnsembleFam: towards more accurate protein family prediction in the twilight zone. BMC Bioinform 23:90. https://doi.org/10.1186/s12859-022-04626-w
Funding
Not applicable.
Author information
Authors and Affiliations
Contributions
“T. Idhaya had done the writing and drafting. A. Suruliandi had done the supervision. S.P. Raja had done the implementation. All authors are aware of the submission. All authors read and agreed on the final version of the manuscript. All authors reviewed the manuscript.”
Corresponding author
Ethics declarations
Conflict of interest
We declare that there is no conflict of interest.
Ethical Approval
Not applicable. We don’t involve humans and animals for our research.
Consent for Publication
Not applicable.
Research Involving Human and Animal Participants
This article does not contain any studies with human participants or animals performed by any of the authors.
Informed Consent
Not applicable.
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Idhaya, T., Suruliandi, A. & Raja, S.P. A Comprehensive Review on Machine Learning Techniques for Protein Family Prediction. Protein J (2024). https://doi.org/10.1007/s10930-024-10181-5
Accepted:
Published:
DOI: https://doi.org/10.1007/s10930-024-10181-5