Skip to main content

Advertisement

Log in

A Comprehensive Review on Machine Learning Techniques for Protein Family Prediction

  • Published:
The Protein Journal Aims and scope Submit manuscript

Abstract

Proteomics is a field dedicated to the analysis of proteins in cells, tissues, and organisms, aiming to gain insights into their structures, functions, and interactions. A crucial aspect within proteomics is protein family prediction, which involves identifying evolutionary relationships between proteins by examining similarities in their sequences or structures. This approach holds great potential for applications such as drug discovery and functional annotation of genomes. However, current methods for protein family prediction have certain limitations, including limited accuracy, high false positive rates, and challenges in handling large datasets. Some methods also rely on homologous sequences or protein structures, which introduce biases and restrict their applicability to specific protein families or structures. To overcome these limitations, researchers have turned to machine learning (ML) approaches that can identify connections between protein features and simplify complex high-dimensional datasets. This paper presents a comprehensive survey of articles that employ various ML techniques for predicting protein families. The primary objective is to explore and improve ML techniques specifically for protein family prediction, thus advancing future research in the field. Through qualitative and quantitative analyses of ML techniques, it is evident that multiple methods utilizing a range of classifiers have been applied for protein family prediction. However, there has been limited focus on developing novel classifiers for protein family classification, highlighting the urgent need for improved approaches in this area. By addressing these challenges, this research aims to enhance the accuracy and effectiveness of protein family prediction, ultimately facilitating advancements in proteomics and its diverse applications.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2

Similar content being viewed by others

Data Availability

Data will be made available based on the request.

Abbreviations

AUC:

Area Under Curve

AUPR:

Area Under Precision Recall Curve

BLAST:

Basic Local Alignment Search Tool

DT:

Decision Tree

ELM:

Extreme Learning Machines

FN:

False Negatives

FP:

False Positives

FPR:

False Positive Rate

GO:

Gene Ontology

GPCR:

G-protein Coupled Receptors

GRU:

Gate Recurrent Unit

HMMs:

Hidden Markov Models

KNN:

K- Nearest Neighbor

MCC:

Mathew’s Correlation Coefficient

ML:

Machine Learning

MLP:

Multilayer Perceptron

NB:

Naive Bayes

Net Go:

Network information based Go

NMR:

Nuclear Magnetic Resonance

PPI:

Protein-Protein interaction

PseAAC:

Pseudo Amino Acid Composition

RF:

Random Forest

SVM:

Support Vector Machine

TN:

True Negatives

TP:

True Positives

References

  1. Al-Amrani S, Al-Jabri Z, Al-Zaabi A, Alshekaili J, Al-Khabori M (2021) Proteomics: concepts and applications in human medicine. World J Biol Chem 12(5):57–69. https://doi.org/10.4331/wjbc.v12.i5.57

    Article  PubMed  PubMed Central  Google Scholar 

  2. Bonetta R, Valentino G (2020) Machine learning techniques for protein function prediction. Proteins 88(3):397–413. https://doi.org/10.1002/prot.25832

    Article  CAS  PubMed  Google Scholar 

  3. Zalewski JK, Heber S, Mo JH, O’Conor K, Hildebrand JD, VanDemark AP (2017) Combining wet and dry lab techniques to guide the crystallization of large coiled-coil containing proteins. J visualized experiments: JoVE. https://doi.org/10.3791/54886

    Article  PubMed Central  Google Scholar 

  4. Koonin EV, Galperin MY (2003) Sequence-evolution - function: computational approaches in comparative genomics. boston: Kluwer Academic; principles and methods of sequence analysis. Available from: https://www.ncbi.nlm.nih.gov/books/NBK20261

  5. Zehetner G (2003) Ontoblast function: from sequence similarities directly to potential functional annotations by ontology terms. Nucleic Acids Res 31(13):3799–3803. https://doi.org/10.1093/nar/gkg555

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  6. Groth D, Lehrach H, Hennig S (2004) GOblet: a platform for gene ontology annotation of anonymous sequence data. Nucleic Acids Res 32:W313–W317. https://doi.org/10.1093/nar/gkh406

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  7. Martin DM, Berriman M, Barton GJ (2004) GOtcha: a new method for prediction of protein function assessed by the annotation of seven genomes. BMC Bioinformatics 5:178. https://doi.org/10.1186/1471-2105-5-178

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  8. Clark WT, Radivojac P (2011) Analysis of protein function and its prediction from amino acid sequence. Proteins 79(7):2086–2096. https://doi.org/10.1002/prot.23029

    Article  CAS  PubMed  Google Scholar 

  9. Rentzsch R, Orengo CA (2013) Protein function prediction using domain families. BMC Bioinform. https://doi.org/10.1186/1471-2105-14-S3-S5

    Article  Google Scholar 

  10. Cozzetto D, Buchan DW, Bryson K, Jones DT (2013) Protein function prediction by massive integration of evolutionary analyses and multiple data sources. BMC Bioinform. https://doi.org/10.1186/1471-2105-14-S3-S1

    Article  Google Scholar 

  11. Piovesan D, Giollo M, Leonardi E, Ferrari C, Tosatto SC (2015) INGA: protein function prediction combining interaction networks, domain assignments and sequence similarity. Nucleic Acids Res 43(W1):W134–W140. https://doi.org/10.1093/nar/gkv523

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  12. Ronghui You W (2019) NetGO: improving large-scale protein function prediction with massive network information. Nucl Acids Res. https://doi.org/10.1093/nar/gkz388

    Article  PubMed  PubMed Central  Google Scholar 

  13. Deng M, Zhang K, Mehta S, Chen T, Sun F (2002), August prediction of protein function using protein-protein interaction data. In Proceedings. IEEE Computer Society Bioinformatics Conference (pp. 197–206). 

  14. Vinga S, Almeida J (2003) Alignment-free sequence comparison—a review. Bioinformatics 19(4):513–523

    Article  CAS  PubMed  Google Scholar 

  15. Letovsky S, Kasif S (2003) Predicting protein function from protein/protein interaction data: a probabilistic approach. Bioinformatics 19(suppl1):i197–i204

    Article  PubMed  Google Scholar 

  16. Lingner T, Meinicke P (2006) Remote homology detection based on oligomer distances. Bioinformatics 22(18):2224–2231

    Article  CAS  PubMed  Google Scholar 

  17. Chua HN, Sung WK, Wong L (2006) Exploiting indirect neighbours and topological weight to predict protein function from protein–protein interactions. Bioinformatics 22(13):1623–1630

    Article  CAS  PubMed  Google Scholar 

  18. Chou KC, Elrod DW (2003) Prediction of enzyme family classes. J Proteome Res 2(2):183–190. https://doi.org/10.1021/pr0255710

    Article  CAS  PubMed  Google Scholar 

  19. Cai CZ, Han LY, Ji ZL, Chen X, Chen YZ (2003) SVM-Prot: web-based support vector machine software for functional classification of a protein from its primary sequence. Nucl Acids Res 31(13):3692–3697. https://doi.org/10.1093/nar/gkg600

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  20. Cai CZ, Han LY, Ji ZL, Chen YZ (2004) Enzyme family classification by support vector machines. Proteins 55(1):66–76. https://doi.org/10.1002/prot.20045

    Article  CAS  PubMed  Google Scholar 

  21. Bhasin M, Raghava GP (2004) Classification of nuclear receptors based on amino acid composition and dipeptide composition. J Biol Chem 279(22):23262–23266. https://doi.org/10.1074/jbc.M401932200

    Article  CAS  PubMed  Google Scholar 

  22. Bhasin M, Raghava GP (2004) GPCRpred: an SVM-based method for prediction of families and subfamilies of G-protein coupled receptors. Nucleic Acids Res. https://doi.org/10.1093/nar/gkh416

    Article  PubMed  PubMed Central  Google Scholar 

  23. Cai YD, Chou KC (2005) Using functional domain composition to predict enzyme family classes. J Proteome Res 4(1):109–111. https://doi.org/10.1021/pr049835p

    Article  CAS  PubMed  Google Scholar 

  24. Ong SA, Lin HH, Chen YZ et al (2007) Efficacy of different protein descriptors in predicting protein functional families. BMC Bioinform 8:300. https://doi.org/10.1186/1471-2105-8-300

    Article  CAS  Google Scholar 

  25. Zhu F, Han LY, Chen X, Lin HH, Ong S, Xie B, Zhang HL, Chen YZ (2008) Homology-free prediction of functional class of proteins and peptides by support vector machines. Curr Protein Pept Sci 9(1):70–95. https://doi.org/10.2174/138920308783565697

    Article  CAS  PubMed  Google Scholar 

  26. Peng ZL, Yang JY, Chen X (2010) An improved classification of G-protein-coupled receptors using sequence-derived features. BMC Bioinform 11:420. https://doi.org/10.1186/1471-2105-11-420

    Article  CAS  Google Scholar 

  27. Cao J, Xiong L (2014) Protein sequence classification with improved extreme learning machine algorithms. Biomed Res Int 2014:1–12. https://doi.org/10.1155/2014/103054

    Article  Google Scholar 

  28. Iqbal MJ, Faye I, Samir BB, Said M (2014) Efficient feature selection and classification of protein sequence data in bioinformatics. Sci World J. https://doi.org/10.1155/2014/173869

    Article  Google Scholar 

  29. Zhong J, Wang J, Peng W, Zhang Z, Li M (2015) A feature selection method for prediction essential protein. Tsinghua Sci Technol 20(5):491–499. https://doi.org/10.1109/tst.2015.7297748

    Article  MathSciNet  CAS  Google Scholar 

  30. Lee, Nguyen (2018) Protein family classification with neural network, Stanford University, https://cs224d.stanford.edu/reports/LeeNguyen.pdf

  31. Tan JX, Lv H, Wang F, Dao FY, Chen W, Ding H (2019) A survey for predicting enzyme family classes using machine learning methods. Curr Drug Targets 20(5):540–550. https://doi.org/10.2174/1389450119666181002143355

    Article  CAS  PubMed  Google Scholar 

  32. Han K, Wang M, Zhang L, Wang Y, Guo M, Zhao M, Wang C (2019) Predicting ion channels genes and their types with machine learning techniques. Front Genet. https://doi.org/10.3389/fgene.2019.00399

    Article  PubMed  PubMed Central  Google Scholar 

  33. Zhang L, Dong B, Teng Z, Zhang Y, Juan L (2020) Identification of human enzymes using amino acid composition and the composition of k-Spaced amino acid pairs. Biomed Res Int 2020:1–11. https://doi.org/10.1155/2020/9235920

    Article  CAS  Google Scholar 

  34. Siddha SS (2020) Protein sequence classification using machine learning, research project, National College of Ireland, https://norma.ncirl.ie/4472/1/shravaneeshekharsiddha.pdf

  35. Hakala K, Kaewphan S, Bjorne J, Mehryary F, Moen H, Tolvanen M, Salakoski T, Ginter F (2022) Neural network and random forest models in protein function prediction. IEEE/ACM Trans Comput Biol Bioinf 19(3):1772–1781. https://doi.org/10.1109/TCBB.2020.3044230

    Article  CAS  Google Scholar 

  36. Li Y, Zhang Z, Teng Z, Liu X (2020) PredAmyl-MLP: prediction of amyloid proteins using multilayer perceptron. Comput Math Methods Med. https://doi.org/10.1155/2020/8845133

    Article  PubMed  PubMed Central  Google Scholar 

  37. Kabir MN, Wong L (2022) EnsembleFam: towards more accurate protein family prediction in the twilight zone. BMC Bioinform 23:90. https://doi.org/10.1186/s12859-022-04626-w

    Article  CAS  Google Scholar 

Download references

Funding

Not applicable.

Author information

Authors and Affiliations

Authors

Contributions

“T. Idhaya had done the writing and drafting. A. Suruliandi had done the supervision. S.P. Raja had done the implementation. All authors are aware of the submission. All authors read and agreed on the final version of the manuscript. All authors reviewed the manuscript.”

Corresponding author

Correspondence to T. Idhaya.

Ethics declarations

Conflict of interest

We declare that there is no conflict of interest.

Ethical Approval

Not applicable. We don’t involve humans and animals for our research.

Consent for Publication

Not applicable.

Research Involving Human and Animal Participants

This article does not contain any studies with human participants or animals performed by any of the authors.

Informed Consent

Not applicable.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Idhaya, T., Suruliandi, A. & Raja, S.P. A Comprehensive Review on Machine Learning Techniques for Protein Family Prediction. Protein J (2024). https://doi.org/10.1007/s10930-024-10181-5

Download citation

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s10930-024-10181-5

Keywords

Navigation