Skip to main content
Log in

Development and validation of multiple machine learning algorithms for the classification of G-protein-coupled receptors using molecular evolution model-based feature extraction strategy

  • Original Article
  • Published:
Amino Acids Aims and scope Submit manuscript

Abstract

Machine learning is one of the most potential ways to realize the function prediction of the incremental large-scale G-protein-coupled receptors (GPCR). Prior research reveals that the key to determining the overall classification accuracy of GPCR is extracting valuable features and filtering out redundancy. To achieve a more efficient classification model, we put the feature synonym problem into consideration and create a new method based on functional word clustering and integration. Through evaluating the evolution correlation between features using the transition scores in mature molecular substitution matrices, candidate features are clustered into synonym groups. Each group of the clustered features is then integrated and represented by a unique key functional word. These retained key functional words are used to form a feature knowledge base. The original GPCR sequences are then transferred into feature vectors based on a feature re-extraction strategy according to the features in the knowledge base before the training and testing stage. We create multiple machine learning models based on Naïve Bayesian (NB), random forest (RF), support vector machine (SVM), and multi-layer perceptron (MLP) algorithms. The established model is applied to classify two public data sets containing 8354 and 12,731 GPCRs, respectively. These models achieve significant performance in almost all evaluation criteria in comparison with state-of-the art. This work demonstrated the potential of the novel feature extraction strategy and provided an effective theoretical design for the hierarchical classification of GPCRs.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4

Similar content being viewed by others

Data availability

The source code and the experimental data sets are all freely available on the web at https://gitee.com/wei_xiaolin/gpcr-project.git.

Abbreviations

GPCR:

G-protein-coupled receptors

SVM:

Support vector machine

KNN:

K-nearest neighbor

RF:

Random forest

GA:

Genetic algorithm

CFS:

Correlation feature selection

MSA:

Multiple sequence alignments

MUSCLE:

Multiple sequence comparison by log-expectation

PAM:

Point accepted mutation

BLOSUM:

Block substitution matrix

NB:

Naive Bayesian

MLP:

Multi-layer perception

NN:

Neutral network

References

  • Ballesteros J, Palczewski K (2001) G protein-coupled receptor drug discovery: implications from the crystal structure of rhodopsin. Curr Opin Drug Discov Dev 4(5):561–574

    CAS  Google Scholar 

  • Becker OM, Marantz Y, Shacham S et al (2004) G protein-coupled receptors: in silico drug discovery in 3D. Proc Natl Acad Sci USA 101(31):11304–11309

    Article  CAS  Google Scholar 

  • Bhaskar H, Hoyle DC, Singh S (2006) Machine learning in bioinformatics: a brief survey and recommendations for practitioners. Comput Biol Med 36(10):1104–1125

    Article  Google Scholar 

  • Bishop CM (2006) Pattern recognition and machine learning. Springer-Verlag, New York

    Google Scholar 

  • Bu L, Michino M, Wolf RM et al (2008) Improved model building and assessment of the calcium-sensing receptor transmembrane domain. Proteins 71(1):215–226

    Article  CAS  Google Scholar 

  • Chambers G, Lawrie L, Cash P et al (2000) Proteomics: a new approach to the study of disease. J Pathol 192(3):280–288

    Article  CAS  Google Scholar 

  • Cheng BY, Carbonell JG, Kleinseetharaman J et al (2005) Protein classification based on text document classification techniques. Proteins 58(4):955–970

    Article  CAS  Google Scholar 

  • Cunningham P (1999) Biological sequence analysis. Probabilistic models of proteins and nucleic acids. R. Durbin, S. Eddy, A. Krogh and G. Mitchison. Cell Biochem Funct 17(1):73–73

    Article  CAS  Google Scholar 

  • Davies MN, Secker A, Freitas AA et al (2007) On the hierarchical classification of G protein-coupled receptors. Bioinformatics 23(23):3113–3118

    Article  CAS  Google Scholar 

  • Dongardive J, Abraham S (2016) Protein sequence classification based on n-gram and k-nearest neighbor algorithm. In: Computational intelligence in data mining, vol 2. Springer, pp 163–171

  • Downward J (2001) The ins and outs of signalling. Nature 411(6839):759–762

    Article  CAS  Google Scholar 

  • Edgar RC (2004) MUSCLE: a multiple sequence alignment method with reduced time and space complexity. BMC Bioinform 5(1):113

    Article  Google Scholar 

  • Felsenstein J (1981) Evolutionary trees from DNA sequences: a maximum likelihood approach. J Mol Evol 17(6):368–376

    Article  CAS  Google Scholar 

  • Gacasan SB, Baker DL, Parrill AL (2017) G protein-coupled receptors: the evolution of structural insight. AIMS Biophys 4(3):491–527

    Article  CAS  Google Scholar 

  • Hasegawa M, Kishino H, Yano T et al (1985) Dating of the human-ape splitting by a molecular clock of mitochondrial DNA. J Mol Evol 22(2):160–174

    Article  CAS  Google Scholar 

  • Hauser AS, Attwood MM, Raskandersen M et al (2017) Trends in GPCR drug discovery: new agents, targets and indications. Nat Rev Drug Discov 16(12):829–842

    Article  CAS  Google Scholar 

  • Hénin J, Maigret B, Tarek M, Escrieut C, Fourmy D, Chipot C (2006) Probing a model of a GPCR/ligand complex in an explicit membrane environment: the human cholecystokinin-1 receptor. Biophys J 90(4):1232–1240

    Article  Google Scholar 

  • Jukes TH (1969) Evolution of protein molecules. Mamm Protein Metab 3:21–132

    Article  CAS  Google Scholar 

  • Kamal NA, Bakar AA, Zainudin S et al (2015) Filter-wrapper approach to feature selection of GPCR protein. In: International conference on electrical engineering and informatics, pp 693–698

  • Klabunde T, Hessler G (2002) Drug design strategies for targeting G-protein-coupled receptors. ChemBioChem 3(10):928–944

    Article  CAS  Google Scholar 

  • Lander ES, Linton L, Birren BW et al (2001) Initial sequencing and analysis of the human genome. Nature 409(6822):860–921

    Article  CAS  Google Scholar 

  • Lebon G, Warne T, Edwards P et al (2011) Agonist-bound adenosine A2A receptor structures reveal common features of GPCR activation. Nature 474:521–525

    Article  CAS  Google Scholar 

  • Lengeler JW (2000) Metabolic networks: a signal-oriented approach to cellular models. Biol Chem 381(9–10):911–920

    CAS  PubMed  Google Scholar 

  • Li Z, Zhou X, Dai Z et al (2010) Classification of G-protein coupled receptors based on support vector machine with maximum relevance minimum redundancy and genetic algorithm. BMC Bioinform 11(1):325

    Article  Google Scholar 

  • Loots GG, Locksley RM, Blankespoor CM et al (2000) Identification of a coordinate regulator of interleukins 4, 13, and 5 by cross-species sequence comparisons. Science 288(5463):136–140

    Article  CAS  Google Scholar 

  • Nakashima H, Nishikawa K (1994) Discrimination of intracellular and extracellular proteins using amino acid composition and residue-pair frequencies. J Mol Biol 238(1):54–61

    Article  CAS  Google Scholar 

  • Naveed M, Khan AU (2012) GPCR-MPredictor: multi-level prediction of G protein-coupled receptors using genetic ensemble. Amino Acids 42(5):1809–1823

    Article  CAS  Google Scholar 

  • Oprea TI, Bologa CG, Brunak S et al (2018) Unexplored therapeutic opportunities in the human genome. Nat Rev Drug Discov 17(5):317–332

    Article  CAS  Google Scholar 

  • Pandyszekeres G, Munk C, Tsonkov TM et al (2018) GPCRdb in 2018: adding GPCR structure models and ligands. Nucleic Acids Res 46:440–446

    Article  Google Scholar 

  • Park K, Kanehisa M (2003) Prediction of protein subcellular locations by support vector machines using compositions of amino acids and amino acid pairs. Bioinformatics 19(13):1656–1663

    Article  CAS  Google Scholar 

  • Saidi R, Maddouri M, Nguifo EM (2010) Protein sequences classification by means of feature extraction with substitution matrices. BMC Bioinform 11(1):175

    Article  Google Scholar 

  • Thornton JM (2001) From genome to function. Science 292(5524):2095

    Article  CAS  Google Scholar 

  • Venter JC, Adams MD, Myers EW, Li PW et al (2001) The sequence of the human genome. Science 291(5507):1304–1351

    Article  CAS  Google Scholar 

  • Waterston RH, Lindbladtoh K, Birney E et al (2002) Initial sequencing and comparative analysis of the mouse genome. Nature 420(6915):520–562

    Article  CAS  Google Scholar 

  • Yang W, Lu B, Yang Y et al (2006) A comparative study on feature extraction from protein sequences for subcellular localization prediction. In: Computational intelligence in bioinformatics and computational biology, pp 1–8

Download references

Acknowledgements

Not applicable.

Funding

This work was funded by the grant of Natural Science Foundation of China [Grant No: 61602026 to C., L.] and Scientific Research Fund of Provincial Universities [Grant No: 2021J016 to H.Y., Z.].

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Haoyu Zhang.

Ethics declarations

Conflict of interest

The authors declare that there is no conflict of interest.

Research involving human participants and/or animals

This article does not contain any studies with human participants or animals performed by any of the authors.

Informed consent

Informed consent was obtained from all individual participants included in the study.

Additional information

Handling editor: F. Polticelli.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Below is the link to the electronic supplementary material.

Supplementary file1 (DOC 74 kb)

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Ling, C., Wei, X., Shen, Y. et al. Development and validation of multiple machine learning algorithms for the classification of G-protein-coupled receptors using molecular evolution model-based feature extraction strategy . Amino Acids 53, 1705–1714 (2021). https://doi.org/10.1007/s00726-021-03080-x

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00726-021-03080-x

Keywords

Navigation