A Comprehensive Review on Machine Learning Techniques for Protein Family Prediction

Idhaya, T.; Suruliandi, A.; Raja, S. P.

doi:10.1007/s10930-024-10181-5

A Comprehensive Review on Machine Learning Techniques for Protein Family Prediction

Published: 01 March 2024

(2024)
Cite this article

The Protein Journal Aims and scope Submit manuscript

T. Idhaya¹,
A. Suruliandi¹ &
S. P. Raja²

168 Accesses
Explore all metrics

Abstract

Proteomics is a field dedicated to the analysis of proteins in cells, tissues, and organisms, aiming to gain insights into their structures, functions, and interactions. A crucial aspect within proteomics is protein family prediction, which involves identifying evolutionary relationships between proteins by examining similarities in their sequences or structures. This approach holds great potential for applications such as drug discovery and functional annotation of genomes. However, current methods for protein family prediction have certain limitations, including limited accuracy, high false positive rates, and challenges in handling large datasets. Some methods also rely on homologous sequences or protein structures, which introduce biases and restrict their applicability to specific protein families or structures. To overcome these limitations, researchers have turned to machine learning (ML) approaches that can identify connections between protein features and simplify complex high-dimensional datasets. This paper presents a comprehensive survey of articles that employ various ML techniques for predicting protein families. The primary objective is to explore and improve ML techniques specifically for protein family prediction, thus advancing future research in the field. Through qualitative and quantitative analyses of ML techniques, it is evident that multiple methods utilizing a range of classifiers have been applied for protein family prediction. However, there has been limited focus on developing novel classifiers for protein family classification, highlighting the urgent need for improved approaches in this area. By addressing these challenges, this research aims to enhance the accuracy and effectiveness of protein family prediction, ultimately facilitating advancements in proteomics and its diverse applications.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A Systematic Review on Supervised and Unsupervised Machine Learning Algorithms for Data Science

Quantitative Mass Spectrometry-Based Proteomics: An Overview

Feature selection techniques for machine learning: a survey of more than two decades of research

Article 01 December 2023

Data Availability

Data will be made available based on the request.

Abbreviations

AUC:: Area Under Curve
AUPR:: Area Under Precision Recall Curve
BLAST:: Basic Local Alignment Search Tool
DT:: Decision Tree
ELM:: Extreme Learning Machines
FN:: False Negatives
FP:: False Positives
FPR:: False Positive Rate
GO:: Gene Ontology
GPCR:: G-protein Coupled Receptors
GRU:: Gate Recurrent Unit
HMMs:: Hidden Markov Models
KNN:: K- Nearest Neighbor
MCC:: Mathew’s Correlation Coefficient
ML:: Machine Learning
MLP:: Multilayer Perceptron
NB:: Naive Bayes
Net Go:: Network information based Go
NMR:: Nuclear Magnetic Resonance
PPI:: Protein-Protein interaction
PseAAC:: Pseudo Amino Acid Composition
RF:: Random Forest
SVM:: Support Vector Machine
TN:: True Negatives
TP:: True Positives

References

Al-Amrani S, Al-Jabri Z, Al-Zaabi A, Alshekaili J, Al-Khabori M (2021) Proteomics: concepts and applications in human medicine. World J Biol Chem 12(5):57–69. https://doi.org/10.4331/wjbc.v12.i5.57
Article PubMed PubMed Central Google Scholar
Bonetta R, Valentino G (2020) Machine learning techniques for protein function prediction. Proteins 88(3):397–413. https://doi.org/10.1002/prot.25832
Article CAS PubMed Google Scholar
Zalewski JK, Heber S, Mo JH, O’Conor K, Hildebrand JD, VanDemark AP (2017) Combining wet and dry lab techniques to guide the crystallization of large coiled-coil containing proteins. J visualized experiments: JoVE. https://doi.org/10.3791/54886
Article PubMed Central Google Scholar
Koonin EV, Galperin MY (2003) Sequence-evolution - function: computational approaches in comparative genomics. boston: Kluwer Academic; principles and methods of sequence analysis. Available from: https://www.ncbi.nlm.nih.gov/books/NBK20261
Zehetner G (2003) Ontoblast function: from sequence similarities directly to potential functional annotations by ontology terms. Nucleic Acids Res 31(13):3799–3803. https://doi.org/10.1093/nar/gkg555
Article CAS PubMed PubMed Central Google Scholar
Groth D, Lehrach H, Hennig S (2004) GOblet: a platform for gene ontology annotation of anonymous sequence data. Nucleic Acids Res 32:W313–W317. https://doi.org/10.1093/nar/gkh406
Article CAS PubMed PubMed Central Google Scholar
Martin DM, Berriman M, Barton GJ (2004) GOtcha: a new method for prediction of protein function assessed by the annotation of seven genomes. BMC Bioinformatics 5:178. https://doi.org/10.1186/1471-2105-5-178
Article CAS PubMed PubMed Central Google Scholar
Clark WT, Radivojac P (2011) Analysis of protein function and its prediction from amino acid sequence. Proteins 79(7):2086–2096. https://doi.org/10.1002/prot.23029
Article CAS PubMed Google Scholar
Rentzsch R, Orengo CA (2013) Protein function prediction using domain families. BMC Bioinform. https://doi.org/10.1186/1471-2105-14-S3-S5
Article Google Scholar
Cozzetto D, Buchan DW, Bryson K, Jones DT (2013) Protein function prediction by massive integration of evolutionary analyses and multiple data sources. BMC Bioinform. https://doi.org/10.1186/1471-2105-14-S3-S1
Article Google Scholar
Piovesan D, Giollo M, Leonardi E, Ferrari C, Tosatto SC (2015) INGA: protein function prediction combining interaction networks, domain assignments and sequence similarity. Nucleic Acids Res 43(W1):W134–W140. https://doi.org/10.1093/nar/gkv523
Article CAS PubMed PubMed Central Google Scholar
Ronghui You W (2019) NetGO: improving large-scale protein function prediction with massive network information. Nucl Acids Res. https://doi.org/10.1093/nar/gkz388
Article PubMed PubMed Central Google Scholar
Deng M, Zhang K, Mehta S, Chen T, Sun F (2002), August prediction of protein function using protein-protein interaction data. In Proceedings. IEEE Computer Society Bioinformatics Conference (pp. 197–206).
Vinga S, Almeida J (2003) Alignment-free sequence comparison—a review. Bioinformatics 19(4):513–523
Article CAS PubMed Google Scholar
Letovsky S, Kasif S (2003) Predicting protein function from protein/protein interaction data: a probabilistic approach. Bioinformatics 19(suppl1):i197–i204
Article PubMed Google Scholar
Lingner T, Meinicke P (2006) Remote homology detection based on oligomer distances. Bioinformatics 22(18):2224–2231
Article CAS PubMed Google Scholar
Chua HN, Sung WK, Wong L (2006) Exploiting indirect neighbours and topological weight to predict protein function from protein–protein interactions. Bioinformatics 22(13):1623–1630
Article CAS PubMed Google Scholar
Chou KC, Elrod DW (2003) Prediction of enzyme family classes. J Proteome Res 2(2):183–190. https://doi.org/10.1021/pr0255710
Article CAS PubMed Google Scholar
Cai CZ, Han LY, Ji ZL, Chen X, Chen YZ (2003) SVM-Prot: web-based support vector machine software for functional classification of a protein from its primary sequence. Nucl Acids Res 31(13):3692–3697. https://doi.org/10.1093/nar/gkg600
Article CAS PubMed PubMed Central Google Scholar
Cai CZ, Han LY, Ji ZL, Chen YZ (2004) Enzyme family classification by support vector machines. Proteins 55(1):66–76. https://doi.org/10.1002/prot.20045
Article CAS PubMed Google Scholar
Bhasin M, Raghava GP (2004) Classification of nuclear receptors based on amino acid composition and dipeptide composition. J Biol Chem 279(22):23262–23266. https://doi.org/10.1074/jbc.M401932200
Article CAS PubMed Google Scholar
Bhasin M, Raghava GP (2004) GPCRpred: an SVM-based method for prediction of families and subfamilies of G-protein coupled receptors. Nucleic Acids Res. https://doi.org/10.1093/nar/gkh416
Article PubMed PubMed Central Google Scholar
Cai YD, Chou KC (2005) Using functional domain composition to predict enzyme family classes. J Proteome Res 4(1):109–111. https://doi.org/10.1021/pr049835p
Article CAS PubMed Google Scholar
Ong SA, Lin HH, Chen YZ et al (2007) Efficacy of different protein descriptors in predicting protein functional families. BMC Bioinform 8:300. https://doi.org/10.1186/1471-2105-8-300
Article CAS Google Scholar
Zhu F, Han LY, Chen X, Lin HH, Ong S, Xie B, Zhang HL, Chen YZ (2008) Homology-free prediction of functional class of proteins and peptides by support vector machines. Curr Protein Pept Sci 9(1):70–95. https://doi.org/10.2174/138920308783565697
Article CAS PubMed Google Scholar
Peng ZL, Yang JY, Chen X (2010) An improved classification of G-protein-coupled receptors using sequence-derived features. BMC Bioinform 11:420. https://doi.org/10.1186/1471-2105-11-420
Article CAS Google Scholar
Cao J, Xiong L (2014) Protein sequence classification with improved extreme learning machine algorithms. Biomed Res Int 2014:1–12. https://doi.org/10.1155/2014/103054
Article Google Scholar
Iqbal MJ, Faye I, Samir BB, Said M (2014) Efficient feature selection and classification of protein sequence data in bioinformatics. Sci World J. https://doi.org/10.1155/2014/173869
Article Google Scholar
Zhong J, Wang J, Peng W, Zhang Z, Li M (2015) A feature selection method for prediction essential protein. Tsinghua Sci Technol 20(5):491–499. https://doi.org/10.1109/tst.2015.7297748
Article MathSciNet CAS Google Scholar
Lee, Nguyen (2018) Protein family classification with neural network, Stanford University, https://cs224d.stanford.edu/reports/LeeNguyen.pdf
Tan JX, Lv H, Wang F, Dao FY, Chen W, Ding H (2019) A survey for predicting enzyme family classes using machine learning methods. Curr Drug Targets 20(5):540–550. https://doi.org/10.2174/1389450119666181002143355
Article CAS PubMed Google Scholar
Han K, Wang M, Zhang L, Wang Y, Guo M, Zhao M, Wang C (2019) Predicting ion channels genes and their types with machine learning techniques. Front Genet. https://doi.org/10.3389/fgene.2019.00399
Article PubMed PubMed Central Google Scholar
Zhang L, Dong B, Teng Z, Zhang Y, Juan L (2020) Identification of human enzymes using amino acid composition and the composition of k-Spaced amino acid pairs. Biomed Res Int 2020:1–11. https://doi.org/10.1155/2020/9235920
Article CAS Google Scholar
Siddha SS (2020) Protein sequence classification using machine learning, research project, National College of Ireland, https://norma.ncirl.ie/4472/1/shravaneeshekharsiddha.pdf
Hakala K, Kaewphan S, Bjorne J, Mehryary F, Moen H, Tolvanen M, Salakoski T, Ginter F (2022) Neural network and random forest models in protein function prediction. IEEE/ACM Trans Comput Biol Bioinf 19(3):1772–1781. https://doi.org/10.1109/TCBB.2020.3044230
Article CAS Google Scholar
Li Y, Zhang Z, Teng Z, Liu X (2020) PredAmyl-MLP: prediction of amyloid proteins using multilayer perceptron. Comput Math Methods Med. https://doi.org/10.1155/2020/8845133
Article PubMed PubMed Central Google Scholar
Kabir MN, Wong L (2022) EnsembleFam: towards more accurate protein family prediction in the twilight zone. BMC Bioinform 23:90. https://doi.org/10.1186/s12859-022-04626-w
Article CAS Google Scholar

Download references

Funding

Not applicable.

Author information

Authors and Affiliations

Department of Computer Science and Engineering, Manonmaniam Sundaranar University, Tirunelveli, TamilNadu, India
T. Idhaya & A. Suruliandi
School of Computer Science and Engineering, Vellore Institute of Technology, Vellore, TamilNadu, India
S. P. Raja

Authors

T. Idhaya
View author publications
You can also search for this author in PubMed Google Scholar
A. Suruliandi
View author publications
You can also search for this author in PubMed Google Scholar
S. P. Raja
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

“T. Idhaya had done the writing and drafting. A. Suruliandi had done the supervision. S.P. Raja had done the implementation. All authors are aware of the submission. All authors read and agreed on the final version of the manuscript. All authors reviewed the manuscript.”

Corresponding author

Correspondence to T. Idhaya.

Ethics declarations

Conflict of interest

We declare that there is no conflict of interest.

Ethical Approval

Not applicable. We don’t involve humans and animals for our research.

Consent for Publication

Not applicable.

Research Involving Human and Animal Participants

This article does not contain any studies with human participants or animals performed by any of the authors.

Informed Consent

Not applicable.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Idhaya, T., Suruliandi, A. & Raja, S.P. A Comprehensive Review on Machine Learning Techniques for Protein Family Prediction. Protein J (2024). https://doi.org/10.1007/s10930-024-10181-5

Download citation

Accepted: 19 January 2024
Published: 01 March 2024
DOI: https://doi.org/10.1007/s10930-024-10181-5

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A Comprehensive Review on Machine Learning Techniques for Protein Family Prediction

Abstract

Access this article

Similar content being viewed by others

A Systematic Review on Supervised and Unsupervised Machine Learning Algorithms for Data Science

Quantitative Mass Spectrometry-Based Proteomics: An Overview

Feature selection techniques for machine learning: a survey of more than two decades of research

Data Availability

Abbreviations

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Ethical Approval

Consent for Publication

Research Involving Human and Animal Participants

Informed Consent

Additional information

Publisher’s Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

A Comprehensive Review on Machine Learning Techniques for Protein Family Prediction

Abstract

Access this article

Similar content being viewed by others

A Systematic Review on Supervised and Unsupervised Machine Learning Algorithms for Data Science

Quantitative Mass Spectrometry-Based Proteomics: An Overview

Feature selection techniques for machine learning: a survey of more than two decades of research

Data Availability

Abbreviations

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Ethical Approval

Consent for Publication

Research Involving Human and Animal Participants

Informed Consent

Additional information

Publisher’s Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation