Skip to main content
Log in

Classification of G-protein coupled receptors based on a rich generation of convolutional neural network, N-gram transformation and multiple sequence alignments

  • Original Article
  • Published:
Amino Acids Aims and scope Submit manuscript

Abstract

Sequence classification is crucial in predicting the function of newly discovered sequences. In recent years, the prediction of the incremental large-scale and diversity of sequences has heavily relied on the involvement of machine-learning algorithms. To improve prediction accuracy, these algorithms must confront the key challenge of extracting valuable features. In this work, we propose a feature-enhanced protein classification approach, considering the rich generation of multiple sequence alignment algorithms, N-gram probabilistic language model and the deep learning technique. The essence behind the proposed method is that if each group of sequences can be represented by one feature sequence, composed of homologous sites, there should be less loss when the sequence is rebuilt, when a more relevant sequence is added to the group. On the basis of this consideration, the prediction becomes whether a query sequence belonging to a group of sequences can be transferred to calculate the probability that the new feature sequence evolves from the original one. The proposed work focuses on the hierarchical classification of G-protein Coupled Receptors (GPCRs), which begins by extracting the feature sequences from the multiple sequence alignment results of the GPCRs sub-subfamilies. The N-gram model is then applied to construct the input vectors. Finally, these vectors are imported into a convolutional neural network to make a prediction. The experimental results elucidate that the proposed method provides significant performance improvements. The classification error rate of the proposed method is reduced by at least 4.67% (family level I) and 5.75% (family Level II), in comparison with the current state-of-the-art methods. The implementation program of the proposed work is freely available at: https://github.com/alanFchina/CNN.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4

Similar content being viewed by others

References

  • Adams MD, Celniker SE, Holt RA et al (2000) The genome sequence of Drosophila melanogaster. Science 287:2185–2195

    Article  PubMed  Google Scholar 

  • Altschul SF (1991) Amino acid substitution matrices from an information theoretic perspective. J Mol Biol 219:555–565

    Article  CAS  PubMed  Google Scholar 

  • Altschul SF, Madden TL, Schaffer AA et al (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucl Acids Res 25:3389–3402

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  • Bairoch A, Boeckmann B, Ferro S et al (2004) Swiss-Prot: juggling between evolution and stability. Brief Bioinform 5:39–55

    Article  CAS  PubMed  Google Scholar 

  • Bandyopadhyay S (2005) An efficient technique for superfamily classification of amino acid sequences: feature extraction, fuzzy clustering and prototype selection. Fuzzy Sets Syst 152(1):5–16

    Article  Google Scholar 

  • Bhasin M, Raghava GPS (2004) GPCRpred: an SVM-based method for prediction of families and subfamilies of G-protein coupled receptors. Nucl Acids Res 32:383–389

    Article  Google Scholar 

  • Boeckmann B et al (2003) The SWISS-PROT protein knowledgebase and its supplement TrEMBL. Nucl Acids Res 31:365–370

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  • Boutet E, Lieberherr D, Tognolli M et al (2016) UniProtKB/Swiss-Prot, the manually annotated section of the UniProt knowledge base: how to use the entry view. Methods Mol Biol 1374:23–54

    Article  CAS  PubMed  Google Scholar 

  • Brown PF, Desouza PV, Mercer RL et al (1992) Class-based n-gram models of natural language. Comput Linguist 18(4):467–479

    Google Scholar 

  • Chambers G, Lawrie L, Cash P et al (2000) Proteomics: a new approach to the study of disease. J Pathol 192:280–288

    Article  CAS  PubMed  Google Scholar 

  • Cheng BYM, Carbonell JG, Klein-Seetharaman J et al (2005) Protein classification based on text document classification techniques. Proteins Struct Funct Bioinform 58(4):955–970

    Article  CAS  Google Scholar 

  • Daugaard M, Rohde M, Jäättelä M (2007) The heat shock protein 70 family: highly homologous proteins with overlapping and distinct functions. FEBS Lett 581(19):3702–3710

    Article  CAS  PubMed  Google Scholar 

  • Davies MN, Secker A, Halling-Brown M et al (2008) Gpcrtree: online hierarchical classification of GPCR function. BMC Res Notes 1(1):67

    Article  PubMed  PubMed Central  Google Scholar 

  • Dayhoff MO, Schwartz R, Orcutt BC (1978) A model of evolutionary change in proteins. In: Davidoff MO (ed) Atlas of protein sequence and structure. National Biomedical Research Foundation, Silver Spring (MD), pp 345–352

  • Dongardive J, Abraham S (2016) Protein sequence classification based on n-gram and k-nearest neighbor algorithm. Comput Intell Data Min 2:163–171

    Google Scholar 

  • Durbin R, Eddy S, Krogh A et al (1998) Biological sequence analysis. Cambridge University Press, Cambridge

    Book  Google Scholar 

  • Edgar RC (2004) MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucl Acids Res 32(5):1792–1797

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  • Firdaus MA, Razib MO (2009) Analysis of multiple alignment on the performance of classification algorithm for remote protein homology detection. Genes Dev 22(24):3489–3496

    Google Scholar 

  • George SR, O’Dowd BF, Lee SP (2002) G-protein-coupled receptor oligomerization and its potential for drug discovery. Nat Rev Drug Discov 1(10):808–820

    Article  CAS  PubMed  Google Scholar 

  • Gether U (2000) Uncovering molecular mechanisms involved in activation of G protein-coupled receptors. Endocr Rev 21(1):90–113

    Article  CAS  PubMed  Google Scholar 

  • Gosele C, Hong L, Kreitler T et al (2000) High-throughput scanning of the rat genome using interspersed repetitive sequence-PCR markers. Genomics 69:287–294

    Article  CAS  PubMed  Google Scholar 

  • Henikoff S, Henikoff JG (1992) Amino acid substitution matrices from protein blocks. Proc Natl Acad Sci USA 89:10915–10919

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  • Henikoff S, Henikoff JG, Alford WJ, Pietrokovski S (1995) Automated construction and graphical presentation of protein blocks from unaligned sequences. Gene 163(2):17–26

    Article  Google Scholar 

  • Iqbal MJ, Faye I, Said AM et al (2014) Data mining of protein sequences with amino acid position-based feature encoding technique. In: Proceedings of the first international conference on advanced data and information engineering. Singapore, pp 119–126

  • Isberg V, Vroling B, Rob VDK et al (1998) GPCRDB: An information system for G protein-coupled receptors. Nucl Acid Res 26(1):275–279

    Article  Google Scholar 

  • Jeong JC, Lin X, Chen X (2011) On position-specific scoring matrix for protein function prediction. IEEE/ACM Trans Comput Biol Bioinform 8(2):308–315

    Article  PubMed  Google Scholar 

  • Kalchbrenner N, Grefenstette E, Blunsom PA (2014) A convolutional neural network for modelling sentences. In: Proceedings of the 52nd annual meeting

  • Kamal NAM, Bakar AA, Zainudin S (2015) Filter-wrapper approach to feature selection of GPCR protein[C]. In: International conference on electrical engineering and informatics. IEEE, pp 693–698

  • Krizhevsky A, Sutskever I, Hinton GE (2012) ImageNet classification with deep convolutional neural networks. In: Proceedings of the 26th annual conference on neural information processing systems, NIPS 2012, vol 2, pp 1097–1105

  • Li Z, Zhou X, Dai Z et al (2010) Classification of G-protein coupled receptors based on support vector machine with maximum relevance minimum redundancy and genetic algorithm. BMC Bioinform 11(1):325

    Article  Google Scholar 

  • Li M, Ling C, Gao J (2017) An efficient CNN-based classification on G-protein coupled receptors using TF-IDF and N-gram. In: Proceedings of 2017 IEEE symposium on computers and communications (ISCC), Heraklion, pp 924–931

  • Lynch M (2002) Intron evolution as a population-genetic process. Proc Natl Acad Sci USA 99:6118–6123

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  • Mardia KV, Kent JT, Bibby JM (1979) Multivariate analysis. Academic Press, London, p 521

    Google Scholar 

  • Naveed M, Khan AU (2012) GPCR-MPredictor: multi-level prediction of G protein-coupled receptors using genetic ensemble. Amino Acids 42:1809–1823

    Article  CAS  PubMed  Google Scholar 

  • Notredame C, Higgins DG, Heringa J (1996) T-Coffee: a novel method for fast and accurate multiple sequence alignment. J Mol Biol 302(1):205–217

    Article  Google Scholar 

  • Pearson WR (1996) Effective protein sequence comparison. Methods Enzymol 266:227–258

    Article  CAS  PubMed  Google Scholar 

  • Pearson WR (1998) Empirical statistical estimates for sequence similarity searches. J Mol Biol 276:71–84

    Article  CAS  PubMed  Google Scholar 

  • Ramos J (2003) Using tf-idf to determine word relevance in document queries. In: Proceedings of the 1st instructional conference on machine learning

  • Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition. Comput Sci

  • Thompson JD, Higgins DG, Gibson TJ (1994) CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucl Acids Res 22(22):4673–4680

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  • Thornton JM (2001) From genome to function. Science 292:2095–2097

    Article  CAS  PubMed  Google Scholar 

  • Vinga S, Almeida J (2003) Alignment-free sequence comparison: a review. Bioinformatics 19(4):513–523

    Article  CAS  PubMed  Google Scholar 

  • Vries JK, Munshi R, Tobi D et al (2004) A sequence alignment-independent method for protein classification. Appl Bioinform 3(2):137

    Article  CAS  Google Scholar 

  • Waterston RH, Lindblad-Toh K, Birney E et al (2002) Initial sequencing and comparative analysis of the mouse genome. Nature 420:520–562

    Article  CAS  PubMed  Google Scholar 

  • Wu CH, Huang H, Yeh LL et al (2003) Protein family classification and functional annotation. Comput Biol Chem 27:37–47

    Article  CAS  PubMed  Google Scholar 

  • Yao X (1999) Evolving artificial neural networks. Proc IEEE 87(9):1423–1447

    Article  Google Scholar 

  • Zavaljevski N, Stevens FJ, Reifman J (2002) Support vector machines with selective kernel scaling for protein classification and identification of key amino acid positions. Bioinformatics 18(5):689–696

    Article  CAS  PubMed  Google Scholar 

  • Zhang YX, Perry K, Vinci VA et al (2002) Genome shuffling leads to rapid phenotypic improvement in bacteria. Nature 415:644–646

    Article  CAS  PubMed  Google Scholar 

Download references

Acknowledgements

Funding was provided by National Natural Science Foundation of China (Grant no. 61602026).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Cheng Ling.

Ethics declarations

Conflict of interest

The authors declare that there is no conflict of interest.

Research involving human participants and/or animals

The authors did not use human participants or animals in the investigation.

Informed consent

Informed consent was obtained from all individual participants included in the study.

Additional information

Handling Editor: I. Greger.

Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (DOCX 33 kb)

Supplementary material 2 (DOCX 37 kb)

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Li, M., Ling, C., Xu, Q. et al. Classification of G-protein coupled receptors based on a rich generation of convolutional neural network, N-gram transformation and multiple sequence alignments. Amino Acids 50, 255–266 (2018). https://doi.org/10.1007/s00726-017-2512-4

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00726-017-2512-4

Keywords

Navigation