Skip to main content
Log in

Identifying anticancer peptides by using a generalized chaos game representation

  • Published:
Journal of Mathematical Biology Aims and scope Submit manuscript

Abstract

We generalize chaos game representation (CGR) to higher dimensional spaces while maintaining its bijection, keeping such method sufficiently representative and mathematically rigorous compare to previous attempts. We first state and prove the asymptotic property of CGR and our generalized chaos game representation (GCGR) method. The prediction follows that the dissimilarity of sequences which possess identical subsequences but distinct positions would be lowered exponentially by the length of the identical subsequence; this effect was taking place unbeknownst to researchers. By shining a spotlight on it now, we show the effect fundamentally supports (G)CGR as a similarity measure or feature extraction technique. We develop two feature extraction techniques: GCGR-Centroid and GCGR-Variance. We use the GCGR-Centroid to analyze the similarity between protein sequences by using the datasets 9 ND5, 24 TF and 50 beta-globin proteins. We obtain consistent results compared with previous studies which proves the significance thereof. Finally, by utilizing support vector machines, we train the anticancer peptide prediction model by using both GCGR-Centroid and GCGR-Variance, and achieve a significantly higher prediction performance by employing the 3 well-studied anticancer peptide datasets.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Similar content being viewed by others

References

  • Almeida JS, Carrico JA, Maretzek A, Noble PA, Fletcher M (2001) Analysis of genomic sequences by chaos game representation. Bioinformatics 17(5):429–437

    Article  Google Scholar 

  • Basu S, Pan A, Dutta C, Das J (1997) Chaos game representation of proteins. J Mol Gr Model 15(5):279–289

    Article  Google Scholar 

  • Chan HS, Dill KA (1989) Compact polymers. Macromolecules 22(12):4559–4573

    Article  Google Scholar 

  • Chang CC, Lin CJ (2011) Libsvm: a library for support vector machines. ACM Trans Intell Syst Technol (TIST) 2(3):27

    Google Scholar 

  • Chang G, Wang T (2011) Phylogenetic analysis of protein sequences based on distribution of length about common substring. Protein J 30(3):167–172

    Article  MathSciNet  Google Scholar 

  • Chen K, Kurgan L, Rahbari M (2007) Prediction of protein crystallization using collocation of amino acid pairs. Biochem Biophys Res Commun 355(3):764–769

    Article  Google Scholar 

  • Chen K, Kurgan LA, Ruan J (2008a) Prediction of protein structural class using novel evolutionary collocation-based sequence representation. J Comput Chem 29(10):1596–1604

    Article  Google Scholar 

  • Chen W, Ding H, Feng P, Lin H, Chou KC (2016) IACP: a sequence-based tool for identifying anticancer peptides. Oncotarget 7(13):16,895

    Google Scholar 

  • Chen YZ, Tang YR, Sheng ZY, Zhang Z (2008b) Prediction of mucin-type o-glycosylation sites in mammalian proteins using the composition of k-spaced amino acid pairs. BMC Bioinform 9(1):101

    Article  Google Scholar 

  • Chou KC (2001a) Prediction of protein cellular attributes using pseudo-amino acid composition. Proteins Struct Funct Bioinform 43(3):246–255

    Article  Google Scholar 

  • Chou KC (2001b) Using subsite coupling to predict signal peptides. Protein Eng 14(2):75–79

    Article  Google Scholar 

  • Chou KC, Zhang CT (1995) Prediction of protein structural classes. Crit Rev Biochem Mol Biol 30(4):275–349

    Article  Google Scholar 

  • Cortes C, Vapnik V (1995) Support vector machine. Mach Learn 20(3):273–297

    MATH  Google Scholar 

  • Deschavanne P, Tufféry P (2008) Exploring an alignment free approach for protein classification and structural class prediction. Biochimie 90(4):615–625

    Article  Google Scholar 

  • Deschavanne PJ, Giron A, Vilain J, Fagot G, Fertil B (1999a) Genomic signature: characterization and classification of species assessed by chaos game representation of sequences. Mol Biol Evolut 16(10):1391–1399

    Article  Google Scholar 

  • Deschavanne PJ, Giron A, Vilain J, Fagot G, Fertil B (1999b) Genomic signature: characterization and classification of species assessed by chaos game representation of sequences. Mol Biol Evolut 16(10):1391–1399

    Article  Google Scholar 

  • Dill KA (1985) Theory for the folding and stability of globular proteins. Biochemistry 24(6):1501–1509

    Article  Google Scholar 

  • Dimitriadou E, Hornik K, Leisch F, Meyer D, Weingessel A (2005) Misc functions of the department of statistics (e1071), tu wien. R package version pp 1–5

  • Fang G, Bhardwaj N, Robilotto R, Gerstein MB (2010) Getting started in gene orthology and functional analysis. PLoS Comput Biol 6(3):e1000–703

    Article  Google Scholar 

  • Fiser A, Tusnady GE, Simon I (1994) Chaos game representation of protein structures. J Mol Graph 12(4):302–304

    Article  Google Scholar 

  • Fitch WM (1970) Distinguishing homologous from analogous proteins. Syst Zool 19(2):99–113

    Article  MathSciNet  Google Scholar 

  • Ford MJ (2001) Molecular evolution of transferrin: evidence for positive selection in salmonids. Mol Biol Evolut 18(4):639–647

    Article  Google Scholar 

  • Hajisharifi Z, Piryaiee M, Beigi MM, Behbahani M, Mohabatkar H (2014) Predicting anticancer peptides with chous pseudo amino acid composition and investigating their mutagenicity via ames test. J Theor Biol 341:34–40

    Article  MATH  Google Scholar 

  • He P, Li X, Yang J, Wang J (2011) A novel descriptor for protein similarity analysis. MATCH: communications in mathematical and in computer. Chemistry 65(2):445–458

    MathSciNet  Google Scholar 

  • He PA, Zhang YP, Yao YH, Tang YF, Nan XY (2010) The graphical representation of protein sequences based on the physicochemical properties and its applications. J Comput Chem 31(11):2136–2142

    Article  Google Scholar 

  • He Pa, Li D, Zhang Y, Wang X, Yao Y (2012) A 3d graphical representation of protein sequences based on the gray code. J Theor Biol 304:81–87

    Article  MathSciNet  MATH  Google Scholar 

  • Hoang T, Yin C, Yau SST (2016) Numerical encoding of DNA sequences by chaos game representation with application in similarity comparison. Genomics 108(3–4):134–142

    Article  Google Scholar 

  • Jeffrey HJ (1990) Chaos game representation of gene structure. Nucleic Acids Res 18(8):2163–2170

    Article  Google Scholar 

  • Lam W, Bacchus F (1994) Learning Bayesian belief networks: an approach based on the MDL principle. Comput Intell 10(3):269–293

    Article  Google Scholar 

  • Li FM, Wang XQ (2016) Identifying anticancer peptides by using improved hybrid compositions. Sci Rep 6:33910

    Article  Google Scholar 

  • Li W, Godzik A (2006) Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics 22(13):1658–1659

    Article  Google Scholar 

  • Liao B, Liao B, Lu X, Cao Z (2011) A novel graphical representation of protein sequences and its application. J Comput Chem 32(12):2539–2544

    Article  Google Scholar 

  • Liu Y, Zhang Y (2010) A new method for analyzing H5N1 avian influenza virus. J Comput Chem 47(3):1129–1144

    MathSciNet  MATH  Google Scholar 

  • Luo Ry, Feng Zp, Liu Jk (2002) Prediction of protein structural class by amino acid and polypeptide composition. Eur J Biochem 269(17):4219–4225

    Article  Google Scholar 

  • Matsuda S, Vert JP, Saigo H, Ueda N, Toh H, Akutsu T (2005) A novel representation of protein sequences for prediction of subcellular location using support vector machines. Protein Sci 14(11):2804–2813

    Article  Google Scholar 

  • Mu Z, Wu J, Zhang Y (2013) A novel method for similarity/dissimilarity analysis of protein sequences. Phys A Stat Mech Appl 392(24):6361–6366

    Article  Google Scholar 

  • Nakashima H, Nishikawa K (1994) Discrimination of intracellular and extracellular proteins using amino acid composition and residue-pair frequencies. J Mol Biol 238(1):54–61

    Article  Google Scholar 

  • Paradis E, Claude J, Strimmer K (2004) Ape: analyses of phylogenetics and evolution in r language. Bioinformatics 20(2):289–290

    Article  Google Scholar 

  • Randić M, Novič M, Vračko M (2008) On novel representation of proteins based on amino acid adjacency matrix. SAR QSAR Environ Res 19(3–4):339–349

    Article  Google Scholar 

  • Robinson O, Dylus D, Dessimoz C (2016) Phylo. io: interactive viewing and comparison of large phylogenetic trees on the web. Mol Biol Evolut 33(8):2163–2166

    Article  Google Scholar 

  • Sahu SS, Panda G (2010) A novel feature representation method based on chou’s pseudo amino acid composition for protein structural class prediction. Comput Biol Chem 34(5):320–327

    Article  MATH  Google Scholar 

  • Saitou N, Nei M (1987) The neighbor-joining method: a new method for reconstructing phylogenetic trees. Mol Biol Evolut 4(4):406–425

    Google Scholar 

  • Shamim MTA, Anwaruddin M, Nagarajaram HA (2007) Support vector machine-based classification of protein folds using the structural properties of amino acid residues and amino acid residue pairs. Bioinformatics 23(24):3320–3327

    Article  Google Scholar 

  • Shi JY, Zhang SW, Pan Q, Zhou GP (2008) Using Pseudo amino acid composition to predict protein subcellular location: approached with amino acid composition distribution. Amino Acids 35(2):321–327

    Article  Google Scholar 

  • Sievers F, Wilm A, Dineen D, Gibson TJ, Karplus K, Li W, Lopez R, McWilliam H, Remmert M, Söding J et al (2011) Fast, scalable generation of high-quality protein multiple sequence alignments using clustal omega. Mol Syst Biol 7(1):539

    Article  Google Scholar 

  • Singh R, Xu J, Berger B (2008) Global alignment of multiple protein interaction networks with application to functional orthology detection. Proc Nat Acad Sci 105(35):12,763–12,768

    Article  Google Scholar 

  • Suna D, Xua C, Zhanga Y (2016) A novel method of 2d graphical representation for proteins and its application. RNA 18:20

    Google Scholar 

  • Tanchotsrinon W, Lursinsap C, Poovorawan Y (2015) A high performance prediction of HPV genotypes by Chaos game representation and singular value decomposition. BMC Bioinform 16(1):71

    Article  Google Scholar 

  • Thompson JD, Higgins DG, Gibson TJ (1994) Clustal w: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res 22(22):4673–4680

    Article  Google Scholar 

  • Tyagi A, Kapoor P, Kumar R, Chaudhary K, Gautam A, Raghava G (2013) In silico models for designing and discovering novel anticancer peptides. Sci Rep 3:2984

    Article  Google Scholar 

  • Wang G, Li X, Wang Z (2008) Apd2: the updated antimicrobial peptide database and its application in peptide design. Nucleic Acids Res 37(suppl-1):D933–D937

    Google Scholar 

  • Welch P (1967) The use of fast fourier transform for the estimation of power spectra: a method based on time averaging over short, modified periodograms. IEEE Transact Audio Electroacoust 15(2):70–73

    Article  Google Scholar 

  • Wu H, Zhang Y, Chen W, Mu Z (2015) Comparative analysis of protein primary sequences with graph energy. Phys A Stat Mech Appl 437:249–262

    Article  MATH  Google Scholar 

  • Xu C, Sun D, Liu S, Zhang Y (2016) Protein sequence analysis by incorporating modified chaos game and physicochemical properties into chou’s general pseudo amino acid composition. J Theor Biol 406:105–115

    Article  Google Scholar 

  • Yang JY, Peng ZL, Yu ZG, Zhang RJ, Anh V, Wang D (2009) Prediction of protein structural classes by recurrence quantification analysis based on chaos game representation. J Theor Biol 257(4):618–626

    Article  MATH  Google Scholar 

  • Yao YH, Dai Q, Li C, He PA, Nan XY, Zhang YZ (2008) Analysis of similarity/dissimilarity of protein sequences. Proteins Struct Funct Bioinform 73(4):864–871

    Article  Google Scholar 

  • Yau SST, Yu C, He R (2008) A protein map and its application. DNA and Cell Biol 27(5):241–250

    Article  Google Scholar 

  • Yu HJ, Huang DS (2013) Normalized feature vectors: a novel alignment-free sequence comparison method based on the numbers of adjacent amino acids. IEEE/ACM Trans Comput Biol Bioinform (TCBB) 10(2):457–467

    Article  MathSciNet  Google Scholar 

  • Yu ZG, Anh V, Lau KS (2004) Chaos game representation of protein sequences based on the detailed HP model and their multifractal and correlation analyses. J Theor Biol 226(3):341–348

    Article  MathSciNet  Google Scholar 

  • Zhang L, Liao B, Li D, Zhu W (2009) A novel representation for apoptosis protein subcellular localization prediction using support vector machine. J Theor Biol 259(2):361–365

    Article  MATH  Google Scholar 

  • Zhang Y, Yu X (2010) Analysis of protein sequence similarity. In: 2010 IEEE fifth international conference on bio-inspired computing: theories and applications (BIC-TA), IEEE, pp 1255–1258

Download references

Acknowledgements

We gratefully acknowledge the anonymous reviewers who read our paper and gave some constructive comments. This work is supported by the National Natural Science Foundation of China (Nos. 61877064). Matthias Dehmer thanks the Austrian Science Funds for supporting this work (project P26142).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yusen Zhang.

Additional information

The authors sincerely thank the referees for their valuable and highly constructive comments that have significantly improved this paper. This study was supported by the Shandong Natural Science Foundation (Grant No. ZR2015AM017). Matthias Dehmer thanks the Austrian Science Funds for supporting this work (Project P26142).

Appendix

Appendix

See Tables 7, 8, and 9.

Table 7 The information for 9 ND5 protein sequences
Table 8 The concise information for 24 TF protein sequences
Table 9 50 beta-globin sequences of animal species

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Ge, L., Liu, J., Zhang, Y. et al. Identifying anticancer peptides by using a generalized chaos game representation. J. Math. Biol. 78, 441–463 (2019). https://doi.org/10.1007/s00285-018-1279-x

Download citation

  • Received:

  • Revised:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00285-018-1279-x

Keywords

Mathematics Subject Classification

Navigation