Classification of G-protein coupled receptors based on a rich generation of convolutional neural network, N-gram transformation and multiple sequence alignments

Li, Man; Ling, Cheng; Xu, Qi; Gao, Jingyang

doi:10.1007/s00726-017-2512-4

Classification of G-protein coupled receptors based on a rich generation of convolutional neural network, N-gram transformation and multiple sequence alignments

Original Article
Published: 18 November 2017

Volume 50, pages 255–266, (2018)
Cite this article

Amino Acids Aims and scope Submit manuscript

Man Li¹,
Cheng Ling ORCID: orcid.org/0000-0003-3827-849X¹,
Qi Xu¹ &
…
Jingyang Gao¹

684 Accesses
8 Citations
Explore all metrics

Abstract

Sequence classification is crucial in predicting the function of newly discovered sequences. In recent years, the prediction of the incremental large-scale and diversity of sequences has heavily relied on the involvement of machine-learning algorithms. To improve prediction accuracy, these algorithms must confront the key challenge of extracting valuable features. In this work, we propose a feature-enhanced protein classification approach, considering the rich generation of multiple sequence alignment algorithms, N-gram probabilistic language model and the deep learning technique. The essence behind the proposed method is that if each group of sequences can be represented by one feature sequence, composed of homologous sites, there should be less loss when the sequence is rebuilt, when a more relevant sequence is added to the group. On the basis of this consideration, the prediction becomes whether a query sequence belonging to a group of sequences can be transferred to calculate the probability that the new feature sequence evolves from the original one. The proposed work focuses on the hierarchical classification of G-protein Coupled Receptors (GPCRs), which begins by extracting the feature sequences from the multiple sequence alignment results of the GPCRs sub-subfamilies. The N-gram model is then applied to construct the input vectors. Finally, these vectors are imported into a convolutional neural network to make a prediction. The experimental results elucidate that the proposed method provides significant performance improvements. The classification error rate of the proposed method is reduced by at least 4.67% (family level I) and 5.75% (family Level II), in comparison with the current state-of-the-art methods. The implementation program of the proposed work is freely available at: https://github.com/alanFchina/CNN.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

xCAPT5: protein–protein interaction prediction using deep and wide multi-kernel pooling convolutional neural networks with protein language model

Article Open access 10 March 2024

Align-gram: Rethinking the Skip-gram Model for Protein Sequence Analysis

Article 28 March 2023

Predicting Protein-Protein Interactions from Matrix-Based Protein Sequence Using Convolution Neural Network and Feature-Selective Rotation Forest

Article Open access 08 July 2019

References

Adams MD, Celniker SE, Holt RA et al (2000) The genome sequence of Drosophila melanogaster. Science 287:2185–2195
Article PubMed Google Scholar
Altschul SF (1991) Amino acid substitution matrices from an information theoretic perspective. J Mol Biol 219:555–565
Article CAS PubMed Google Scholar
Altschul SF, Madden TL, Schaffer AA et al (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucl Acids Res 25:3389–3402
Article CAS PubMed PubMed Central Google Scholar
Bairoch A, Boeckmann B, Ferro S et al (2004) Swiss-Prot: juggling between evolution and stability. Brief Bioinform 5:39–55
Article CAS PubMed Google Scholar
Bandyopadhyay S (2005) An efficient technique for superfamily classification of amino acid sequences: feature extraction, fuzzy clustering and prototype selection. Fuzzy Sets Syst 152(1):5–16
Article Google Scholar
Bhasin M, Raghava GPS (2004) GPCRpred: an SVM-based method for prediction of families and subfamilies of G-protein coupled receptors. Nucl Acids Res 32:383–389
Article Google Scholar
Boeckmann B et al (2003) The SWISS-PROT protein knowledgebase and its supplement TrEMBL. Nucl Acids Res 31:365–370
Article CAS PubMed PubMed Central Google Scholar
Boutet E, Lieberherr D, Tognolli M et al (2016) UniProtKB/Swiss-Prot, the manually annotated section of the UniProt knowledge base: how to use the entry view. Methods Mol Biol 1374:23–54
Article CAS PubMed Google Scholar
Brown PF, Desouza PV, Mercer RL et al (1992) Class-based n-gram models of natural language. Comput Linguist 18(4):467–479
Google Scholar
Chambers G, Lawrie L, Cash P et al (2000) Proteomics: a new approach to the study of disease. J Pathol 192:280–288
Article CAS PubMed Google Scholar
Cheng BYM, Carbonell JG, Klein-Seetharaman J et al (2005) Protein classification based on text document classification techniques. Proteins Struct Funct Bioinform 58(4):955–970
Article CAS Google Scholar
Daugaard M, Rohde M, Jäättelä M (2007) The heat shock protein 70 family: highly homologous proteins with overlapping and distinct functions. FEBS Lett 581(19):3702–3710
Article CAS PubMed Google Scholar
Davies MN, Secker A, Halling-Brown M et al (2008) Gpcrtree: online hierarchical classification of GPCR function. BMC Res Notes 1(1):67
Article PubMed PubMed Central Google Scholar
Dayhoff MO, Schwartz R, Orcutt BC (1978) A model of evolutionary change in proteins. In: Davidoff MO (ed) Atlas of protein sequence and structure. National Biomedical Research Foundation, Silver Spring (MD), pp 345–352
Dongardive J, Abraham S (2016) Protein sequence classification based on n-gram and k-nearest neighbor algorithm. Comput Intell Data Min 2:163–171
Google Scholar
Durbin R, Eddy S, Krogh A et al (1998) Biological sequence analysis. Cambridge University Press, Cambridge
Book Google Scholar
Edgar RC (2004) MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucl Acids Res 32(5):1792–1797
Article CAS PubMed PubMed Central Google Scholar
Firdaus MA, Razib MO (2009) Analysis of multiple alignment on the performance of classification algorithm for remote protein homology detection. Genes Dev 22(24):3489–3496
Google Scholar
George SR, O’Dowd BF, Lee SP (2002) G-protein-coupled receptor oligomerization and its potential for drug discovery. Nat Rev Drug Discov 1(10):808–820
Article CAS PubMed Google Scholar
Gether U (2000) Uncovering molecular mechanisms involved in activation of G protein-coupled receptors. Endocr Rev 21(1):90–113
Article CAS PubMed Google Scholar
Gosele C, Hong L, Kreitler T et al (2000) High-throughput scanning of the rat genome using interspersed repetitive sequence-PCR markers. Genomics 69:287–294
Article CAS PubMed Google Scholar
Henikoff S, Henikoff JG (1992) Amino acid substitution matrices from protein blocks. Proc Natl Acad Sci USA 89:10915–10919
Article CAS PubMed PubMed Central Google Scholar
Henikoff S, Henikoff JG, Alford WJ, Pietrokovski S (1995) Automated construction and graphical presentation of protein blocks from unaligned sequences. Gene 163(2):17–26
Article Google Scholar
Iqbal MJ, Faye I, Said AM et al (2014) Data mining of protein sequences with amino acid position-based feature encoding technique. In: Proceedings of the first international conference on advanced data and information engineering. Singapore, pp 119–126
Isberg V, Vroling B, Rob VDK et al (1998) GPCRDB: An information system for G protein-coupled receptors. Nucl Acid Res 26(1):275–279
Article Google Scholar
Jeong JC, Lin X, Chen X (2011) On position-specific scoring matrix for protein function prediction. IEEE/ACM Trans Comput Biol Bioinform 8(2):308–315
Article PubMed Google Scholar
Kalchbrenner N, Grefenstette E, Blunsom PA (2014) A convolutional neural network for modelling sentences. In: Proceedings of the 52nd annual meeting
Kamal NAM, Bakar AA, Zainudin S (2015) Filter-wrapper approach to feature selection of GPCR protein[C]. In: International conference on electrical engineering and informatics. IEEE, pp 693–698
Krizhevsky A, Sutskever I, Hinton GE (2012) ImageNet classification with deep convolutional neural networks. In: Proceedings of the 26th annual conference on neural information processing systems, NIPS 2012, vol 2, pp 1097–1105
Li Z, Zhou X, Dai Z et al (2010) Classification of G-protein coupled receptors based on support vector machine with maximum relevance minimum redundancy and genetic algorithm. BMC Bioinform 11(1):325
Article Google Scholar
Li M, Ling C, Gao J (2017) An efficient CNN-based classification on G-protein coupled receptors using TF-IDF and N-gram. In: Proceedings of 2017 IEEE symposium on computers and communications (ISCC), Heraklion, pp 924–931
Lynch M (2002) Intron evolution as a population-genetic process. Proc Natl Acad Sci USA 99:6118–6123
Article CAS PubMed PubMed Central Google Scholar
Mardia KV, Kent JT, Bibby JM (1979) Multivariate analysis. Academic Press, London, p 521
Google Scholar
Naveed M, Khan AU (2012) GPCR-MPredictor: multi-level prediction of G protein-coupled receptors using genetic ensemble. Amino Acids 42:1809–1823
Article CAS PubMed Google Scholar
Notredame C, Higgins DG, Heringa J (1996) T-Coffee: a novel method for fast and accurate multiple sequence alignment. J Mol Biol 302(1):205–217
Article Google Scholar
Pearson WR (1996) Effective protein sequence comparison. Methods Enzymol 266:227–258
Article CAS PubMed Google Scholar
Pearson WR (1998) Empirical statistical estimates for sequence similarity searches. J Mol Biol 276:71–84
Article CAS PubMed Google Scholar
Ramos J (2003) Using tf-idf to determine word relevance in document queries. In: Proceedings of the 1st instructional conference on machine learning
Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition. Comput Sci
Thompson JD, Higgins DG, Gibson TJ (1994) CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucl Acids Res 22(22):4673–4680
Article CAS PubMed PubMed Central Google Scholar
Thornton JM (2001) From genome to function. Science 292:2095–2097
Article CAS PubMed Google Scholar
Vinga S, Almeida J (2003) Alignment-free sequence comparison: a review. Bioinformatics 19(4):513–523
Article CAS PubMed Google Scholar
Vries JK, Munshi R, Tobi D et al (2004) A sequence alignment-independent method for protein classification. Appl Bioinform 3(2):137
Article CAS Google Scholar
Waterston RH, Lindblad-Toh K, Birney E et al (2002) Initial sequencing and comparative analysis of the mouse genome. Nature 420:520–562
Article CAS PubMed Google Scholar
Wu CH, Huang H, Yeh LL et al (2003) Protein family classification and functional annotation. Comput Biol Chem 27:37–47
Article CAS PubMed Google Scholar
Yao X (1999) Evolving artificial neural networks. Proc IEEE 87(9):1423–1447
Article Google Scholar
Zavaljevski N, Stevens FJ, Reifman J (2002) Support vector machines with selective kernel scaling for protein classification and identification of key amino acid positions. Bioinformatics 18(5):689–696
Article CAS PubMed Google Scholar
Zhang YX, Perry K, Vinci VA et al (2002) Genome shuffling leads to rapid phenotypic improvement in bacteria. Nature 415:644–646
Article CAS PubMed Google Scholar

Download references

Acknowledgements

Funding was provided by National Natural Science Foundation of China (Grant no. 61602026).

Author information

Authors and Affiliations

Department of Computer Science and Technology, College of Information Science and Technology, Beijing University of Chemical Technology, Beijing, China
Man Li, Cheng Ling, Qi Xu & Jingyang Gao

Authors

Man Li
View author publications
You can also search for this author in PubMed Google Scholar
Cheng Ling
View author publications
You can also search for this author in PubMed Google Scholar
Qi Xu
View author publications
You can also search for this author in PubMed Google Scholar
Jingyang Gao
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Cheng Ling.

Ethics declarations

Conflict of interest

The authors declare that there is no conflict of interest.

Research involving human participants and/or animals

The authors did not use human participants or animals in the investigation.

Informed consent

Informed consent was obtained from all individual participants included in the study.

Additional information

Handling Editor: I. Greger.

Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (DOCX 33 kb)

Supplementary material 2 (DOCX 37 kb)

Rights and permissions

Reprints and permissions

About this article

Cite this article

Li, M., Ling, C., Xu, Q. et al. Classification of G-protein coupled receptors based on a rich generation of convolutional neural network, N-gram transformation and multiple sequence alignments. Amino Acids 50, 255–266 (2018). https://doi.org/10.1007/s00726-017-2512-4

Download citation

Received: 24 June 2017
Accepted: 14 November 2017
Published: 18 November 2017
Issue Date: February 2018
DOI: https://doi.org/10.1007/s00726-017-2512-4

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Classification of G-protein coupled receptors based on a rich generation of convolutional neural network, N-gram transformation and multiple sequence alignments

Abstract

Access this article

Similar content being viewed by others

xCAPT5: protein–protein interaction prediction using deep and wide multi-kernel pooling convolutional neural networks with protein language model

Align-gram: Rethinking the Skip-gram Model for Protein Sequence Analysis

Predicting Protein-Protein Interactions from Matrix-Based Protein Sequence Using Convolution Neural Network and Feature-Selective Rotation Forest

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Research involving human participants and/or animals

Informed consent

Additional information

Electronic supplementary material

Supplementary material 1 (DOCX 33 kb)

Supplementary material 2 (DOCX 37 kb)

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Classification of G-protein coupled receptors based on a rich generation of convolutional neural network, N-gram transformation and multiple sequence alignments

Abstract

Access this article

Similar content being viewed by others

xCAPT5: protein–protein interaction prediction using deep and wide multi-kernel pooling convolutional neural networks with protein language model

Align-gram: Rethinking the Skip-gram Model for Protein Sequence Analysis

Predicting Protein-Protein Interactions from Matrix-Based Protein Sequence Using Convolution Neural Network and Feature-Selective Rotation Forest

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Research involving human participants and/or animals

Informed consent

Additional information

Electronic supplementary material

Supplementary material 1 (DOCX 33 kb)

Supplementary material 2 (DOCX 37 kb)

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation