Skip to main content
Log in

Evolutionary mechanism and biological functions of 8-mers containing CG dinucleotide in yeast

  • Original Article
  • Published:
Chromosome Research Aims and scope Submit manuscript

Abstract

The rules of k-mer non-random usage and the biological functions are worthy of special attention. Firstly, the article studied human 8-mer spectra and found that only the spectra of cytosine-guanine (CG) dinucleotide classification formed independent unimodal distributions when the 8-mers were classified into three subsets under 16 dinucleotide classifications. Secondly, the distribution rules were reproduced by other seven species including yeast, which showed that the evolution phenomenon had species universality. It followed that we proposed two theoretical conjectures: (1) CG1 motifs (8-mers including 1 CG) are the nucleosome-binding motifs. (2) CG2 motifs (8-mers including two or more than two CG) are the modular units of CpG islands. Our conjectures were confirmed in yeast by the following results: a maximum of average area under the receiver operating characteristic (AUC) resulted from CG1 information during nucleosome core sequences, and linker sequences were distinguished by three CG subsets; there was a one-to-one relationship between abundant CG1 signal regions and histone positions; the sequence changing of squeezed nucleosomes was relevant with the strength of CG1 signals; and the AUC value of 0.986 was based on CG2 information when CpG islands and non-CpG islands were distinguished by the three CG subsets.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10

Similar content being viewed by others

Abbreviations

CGI:

CpG island

CGi :

The CG0, CG1, and CG2 subsets

NCS:

Nucleosome core sequence

NLS:

Nucleosome linker sequence

Non-CGI:

Non-CpG island

NSRE:

New symmetric relative entropy

Par:

Characteristic parameter

RF:

Relative frequency

RMN:

Relative motif number

Sn :

Sensitivity

Sp :

Specificity

XY0 :

8-mers without XY dinucleotide

XY1 :

8-mers including 1 XY dinucleotide

XY2 :

8-mers including 2 or more than 2 XY dinucleotides

References

  • Badis G, Berger MF, Philippakis AA et al (2009) Diversity and complexity in DNA recognition by transcription factors. Science 324:1720–1723

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  • Bao T, Li H, Zhao XQ, Liu GQ (2012) Predicting nucleosome binding motif set and analyzing their distributions around functional sites of human genes. Chromosom Res 20:685–698

    Article  CAS  Google Scholar 

  • Blaisdell BE (1986) A measure of the similarity of sets of sequences not requiring sequence alignment. Proc Natl Acad Sci 83:5155–5159

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  • Brogaard K, Xi LQ, Wang JP, Widom J (2012) A map of nucleosome positions in yeast at base-pair resolution. Nature 486:496–501

    CAS  PubMed  PubMed Central  Google Scholar 

  • Castellini A, Franco G, Manca V (2012) A dictionary based informational genome analysis. BMC Genomics 13:485

    Article  PubMed  PubMed Central  Google Scholar 

  • Chae H, Park J, Lee SW, Nephew KP, Kim S (2013) Comparative analysis using K-mer and K-flank patterns provides evidence for CpG island sequence evolution in mammalian genomes. Nucleic Acids Res 41:4783–4791

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  • Chan BY, Kibler D (2005) Using hexamers to predict cis-regulatory motifs in Drosophila. Bmc Bioinformatics 6:1–19

    Article  CAS  Google Scholar 

  • Chen YH, Nyeo SL, Yeh CY (2005) Model for the distributions of k-mers in DNA sequences. Phys Rev E 72:011908

    Article  Google Scholar 

  • Chen W, Feng PM, Lin H, Chou KC (2013) iRSpot-PseDNC: identify recombination spots with pseudo dinucleotide composition. Nucleic Acids Res 41:e68

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  • Chen W, Lei TY, Jin DC, Lin H, Chou KC (2014) PseKNC: a flexible web server for generating pseudo K-tuple nucleotide composition. Anal Biochem 456:53–60

    Article  CAS  PubMed  Google Scholar 

  • Chen W, Zhang XT, Brooker J, Lin H, Zhang LQ, Chou KC (2015a) PseKNC-General: a cross-platform package for generating various modes of pseudo nucleotide compositions. Bioinformatics 31:119–120

    Article  CAS  PubMed  Google Scholar 

  • Chen W, Lin H, Chou KC (2015b) Pseudo nucleotide composition or PseKNC: an effective formulation for analyzing genomic sequences. Mol BioSyst 11:2620–2634

    Article  CAS  PubMed  Google Scholar 

  • Chen W, Ding H, Feng PM, Lin H, Chou KC (2016) iACP: a sequence-based tool for identifying anticancer peptides. Oncotarget 7:16895–16909

    PubMed  PubMed Central  Google Scholar 

  • Cheng X, Zhao SG, Xiao X, Chou KC (2016) iATC-mISF: a multi-label classifier for predicting the classes of anatomical therapeutic chemicals. Bioinformatics doi. doi:10.1093/bioinformatics/btw644

    Google Scholar 

  • Chereji RV, Morozov AV (2015) Functional roles of nucleosome stability and dynamics. Briefings In Functional Genomics 14:50–60

    Article  PubMed  Google Scholar 

  • Chor B, Horn D, Goldman N, Levy Y, Massingham T (2010) Genomic DNA k-mer spectra: models and modalities. Genome Biol 10:R108

    Article  Google Scholar 

  • Chou KC (2013) Some remarks on predicting multi-label attributes in molecular biosystems. Mol BioSyst 9:1092–1100

    Article  CAS  PubMed  Google Scholar 

  • Chou KC (2015) Impacts of bioinformatics to medicinal chemistry. Med Chem 11:218–234

    Article  CAS  PubMed  Google Scholar 

  • Compeau PEC, Pevzner PA, Tesler G (2011) How to apply de Bruijn graphs to genome assembly. Nat Biotechnol 29:987–991

    Article  CAS  PubMed  Google Scholar 

  • Cutter AR, Hayes JJ (2015) A brief review of nucleosome structure. FEBS Lett 589:2914–2922

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  • Das MK, Dai HK (2007) A survey of DNA motif finding algorithms. Bmc Bioinformatics 8:S2

    Article  Google Scholar 

  • Fickett JW, Hatzigeorgiou AG (1997) Eukaryotic promoter recognition. Genome Res 7:861–878

    Article  CAS  PubMed  Google Scholar 

  • Finch JT, Lutter LC, Rhodes D et al (1977) Structure of nucleosome core particales of chromatin. Nature 486:496–501

    Google Scholar 

  • Fofanov Y, Luo Y, Katili C et al (2004) How independent are the appearances of n-mers in different genomes? Bioinformatics 20:2421–2428

    Article  CAS  PubMed  Google Scholar 

  • Garden MG, Frommer M (1987) CpG islands in vertebrate genomes. J Mol Biol 196:261–282

    Article  Google Scholar 

  • Gentles AJ, Karlin S (2001) Genome-scale compositional comparisons in eukaryotes. Genome Res 11:540–546

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  • Giancarlo R, Rombo SE, Utro F (2015) Epigenomic k-mer dictionaries: shedding light on how sequence composition influences in vivo nucleosome positioning. Bioinformatics 31:2939–2946

    Article  CAS  PubMed  Google Scholar 

  • Guo SH, Deng EZ, Xu LQ, Ding H, Lin H, Chen W, Chou KC (2014) iNuc-PseKNC: a sequence-based predictor for predicting nucleosome positioning in genomes with pseudo k-tuple nucleotide composition. Bioinformatics 30:1522–1529

    Article  CAS  PubMed  Google Scholar 

  • Hackenberg M, Rueda A, Carpena P, Bernaola-Galvan P, Barturen G, Oliver JL (2012) Clustering of DNA words and biological function: a proof of principle. J Theor Biol 297:127–136

    Article  CAS  PubMed  Google Scholar 

  • Hanley JA, Mcneil BJ (1982) The meaning and use of the area under a receiver operating characteristic (ROC) curve. Radiology 143:29–36

    Article  CAS  PubMed  Google Scholar 

  • Hariharan R, Simon R, Pillai MR, Taylor TD (2013) Comparative analysis of DNA word abundances in four yeast genomes using a novel statistical background model. PLoS One 8:e58038

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  • Hashim EKM, Abdullah R (2015) Rare k-mer DNA: identification of sequence motifs and prediction of CpG island and promoter. J Theor Biol 387:88–100

    Article  Google Scholar 

  • Jia JH, Zhang LX, Liu Z, Xuan X, Chou KC (2016a) pSumo-CD: predicting sumoylation sites in proteins with covariance discriminant algorithm by incorporating sequence-coupled effects into general PseAAC. Bioinformatics 32:3133–3141

    Article  CAS  PubMed  Google Scholar 

  • Jia JH, Liu Z, Xiao X, Liu BX, Chou KC (2016b) iCar-PseCp: identify carbonylation sites in proteins by Monto Carlo sampling and incorporating sequence coupled effects into general PseAAC. Oncotarget 7:34558–34570

    PubMed  PubMed Central  Google Scholar 

  • Jia JH, Liu Z, Xiao X, Liu BX, Chou KC (2016c) pSuc-Lys: predict lysine succinylation sites in proteins with PseAAC and ensemble random forest approach. J Theor Biol 394:223–230

    Article  CAS  PubMed  Google Scholar 

  • Kent WJ (2002) BLAT—the BLAST-like alignment tool. Genome Res 12:656–664

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  • Kurtz S, Narechania A, Stein JC, Ware D (2008) A new method to compute K-mer frequencies and its application to annotate large repetitive plant genomes. BMC Genomics 9:517

    Article  PubMed  PubMed Central  Google Scholar 

  • Li QZ, Lin H (2006) The recognition and prediction of sigma(70) promoters in Escherichia coli K-12. J Theor Biol 242:135–141

    Article  CAS  PubMed  Google Scholar 

  • Lin H, Li QZ (2011) Eukaryotic and prokaryotic promoter prediction using hybrid approach. Theory Biosci 130:91–100

    Article  PubMed  Google Scholar 

  • Lin WZ, Fang JA, Xiao X, Chou KC (2013) iLoc-Animal: a multi-label learning classifier for predicting subcellular localization of animal proteins. Mol BioSyst 9:634–644

    Article  CAS  PubMed  Google Scholar 

  • Lin H, Deng EZ, Ding H, Chen W, Chou KC (2014) iPro54-PseKNC: a sequence-based predictor for identifying sigma-54 promoters in prokaryote with pseudo k-tuple nucleotide composition. Nucleic Acids Res 42:12961–12972

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  • Liu GQ, Liu J, Cui XJ, Cai L (2012) Sequence-dependent prediction of recombination hotspots in Saccharomyces cerevisiae. J Theor Biol 293:49–54

    Article  CAS  PubMed  Google Scholar 

  • Liu B, Zhang DY, Xu RF, Xu JH, Wang XL, Chen QC, Dong QW, Chou KC (2014) Combining evolutionary information extracted from frequency profiles with sequence-based kernels for protein remote homology detection. Bioinformatics 30:472–479

    Article  CAS  PubMed  Google Scholar 

  • Liu B, Liu FL, Wang XL, Chen JJ, Fang LY, Chou KC (2015a) Pse-in-One: a web server for generating various modes of pseudo components of DNA, RNA, and protein sequences. Nucleic Acids Res 43:65–71

    Article  Google Scholar 

  • Liu B, Liu FL, Fang LY, Wang XL, Chou KC (2015b) repDNA: a Python package to generate various modes of feature vectors for DNA sequences by incorporating user-defined physicochemical properties and sequence-order effects. Bioinformatics 31:1307–1309

    Article  PubMed  Google Scholar 

  • Liu B, Fang LY, Long R, Lan X, Chou KC (2016a) iEnhancer-2L: a two-layer predictor for identifying enhancers and their strength by pseudo k-tuple nucleotide composition. Bioinformatics 32:362–369

    Article  CAS  PubMed  Google Scholar 

  • Liu B, Liu FL, Fang LY, Wang XL, Chou KC (2016b) repRNA: a web server for generating various feature vectors of RNA sequences. Mol Gen Genomics 291:473–481

    Article  CAS  Google Scholar 

  • Liu B, Long R, Chou KC (2016c) iDHS-EL: identifying DNase I hypersensitive sites by fusing three different modes of pseudo nucleotide composition into an ensemble learning framework. Bioinformatics 32:2411–2418

    Article  PubMed  Google Scholar 

  • Liu Z, Xiao X, Yu DJ, Qiu WR, Chou KC (2016d) pRNAm-PC: predicting N-methyladenosine sites in RNA sequences via physical-chemical properties. Anal Biochem 497:60–67

    Article  CAS  PubMed  Google Scholar 

  • Lowary PT, Widom J (1998) New DNA sequence rules for high affinity binding to histone octamer and sequence-directed nucleosome positioning. J Mol Biol 276:19–42

    Article  CAS  PubMed  Google Scholar 

  • Ma P (2015) Relationships of 8-mer usage separation in genomic sequences with different sequence construction and species evolution. Dissertation, Inner Mongolia University

  • Nyamdavaa LH, Zhou DL, XX Y (2015) Theoretical prediction and verification of the nucleosome bounding motifs. Journal Of Inner Mongolia University 46:488–499

    Google Scholar 

  • Ogawa R, Kitagawa N, Ashida H, Saito R, Tomita M (2010) Computational prediction of nucleosome positioning by calculating the relative fragment frequency index of nucleosomal sequences. FEBS Lett 584:1498–1502

    Article  CAS  PubMed  Google Scholar 

  • Qiu WR, Sun BQ, Xiao X, Xu ZC, Chou KC (2016) iPTM-mLys: identifying multiple lysine PTM sites and their different types. Bioinformatics 32:3116–3123

    Article  PubMed  Google Scholar 

  • Quante T, Bird A (2016) Do short, frequent DNA sequence motifs mould the epigenome? Nat Rev Mol Cell Biol 17:257–262

    Article  CAS  PubMed  Google Scholar 

  • Richmond TJ, Davey CA (2003) The structure of DNA in the nucleosome core. Nature 423:145–150

    Article  CAS  PubMed  Google Scholar 

  • Richmond RK, Sargent DF, Richmond TJ, Luger K, Mader AW (1999) Crystal structure of the nucleosome core particle at 2.8 angstrom resolution. Nature 389:251–260

    Article  Google Scholar 

  • Segal E, Widom J (2009) Poly(dA:dT) tracts: major determinants of nucleosome organization. Curr Opin Struct Biol 19:65–71

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  • Spontaneo L, Cercone N (2011) Correlating CpG islands, motifs, and sequence variants in human chromosome 21. BMC Genomics 12:S10

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  • Stacey KJ, Young GR, Clark F et al (2003) The molecular basis for the lack of immunostimulatory activity of vertebrate DNA. J Immunol 170:3614–3620

    Article  CAS  PubMed  Google Scholar 

  • Wen J, Chan RHF, Yau SC, He RL, Yau SST (2014) K-mer natural vector and its application to the phylogenetic analysis of genetic sequences. Gene 546:25–34

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  • Werner T (1999) Motifs for prediction and recognition of eukaryotic promoters. Mamm Genome 10:168–175

    Article  CAS  PubMed  Google Scholar 

  • Wu ZC, Xiao X, Chou KC (2011) iLoc-Plant: a multi-label classifier for predicting the subcellular localization of plant proteins with both single and multiple sites. Mol BioSyst 7:3287–3297

    Article  CAS  PubMed  Google Scholar 

  • Xiao X, Wang P, Lin WZ, Jia JH, Chou KC (2013) iAMP-2L: a two-level multi-label classifier for identifying antimicrobial peptides and their functional types. Anal Biochem 436:168–177

    Article  CAS  PubMed  Google Scholar 

  • Yang Y, Nephew K, Kim S (2012) A novel k-mer mixture logistic regression for methylation susceptibility modeling of CpG dinucleotides in human gene promoters. Bmc Bioinformatics 13:S15

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  • Yu HJ (2013) Segmented K-mer and its application on similarity analysis of mitochondrial genome sequences. Gene 518:419–424

    Article  CAS  PubMed  Google Scholar 

  • Zhang Y, Wang XH, Kang L (2011) A k-mer scheme to predict piRNAs and characterize locust piRNAs. Bioinformatics 27:771–776

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  • Zhang Q, Li H, Zhao XQ, Zheng Y, Zhou DL (2015) Distribution bias of the sequence matching between exons and introns in exon joint and EJC binding region in C. elegans. J Theor Biol 364:295–304

    Article  CAS  PubMed  Google Scholar 

  • Zhang CJ, Tang H, Li WC, Lin H, Chen W, Chou KC (2016) iOri-Human: identify human origin of replication by incorporating dinucleotide physicochemical properties into pseudo nucleotide composition. Oncotarget 7:69783–69793

    PubMed  PubMed Central  Google Scholar 

  • Zhu XX, Yang Z, Duan CY, Lv WP, Li H (2016) Rules of 8-mer usage in genome sequences and its relation to genome evolution. Chinese Journal of Bioinformatics 4:495–202

    Google Scholar 

Download references

Acknowledgements

The authors thank Prof. Liaofu Luo for illumination discussions. This work was supported by grants from the National Natural Science Foundation of China (No. 31260219).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Hong Li.

Additional information

Responsible Editor: Tatsuo Fukagawa

Electronic supplementary material

ESM 1

(DOC 207 kb)

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Zheng, Y., Li, H., Wang, Y. et al. Evolutionary mechanism and biological functions of 8-mers containing CG dinucleotide in yeast. Chromosome Res 25, 173–189 (2017). https://doi.org/10.1007/s10577-017-9554-z

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10577-017-9554-z

Keywords

Navigation