Employment of Transcription Start Site Data
Although CpG islands are often associated with gene promoters (Bird 1987), more than half are likely located in non-promoter regions in the human genome (Takai and Jones 2003). Using data for transcription start sites (TSSs), we first identified CpG islands that functioned as promoters. We assumed that CpG-island promoters bear at least one TSS in their genomic sequences. TSS of each gene can be readily obtained as the first position of curated cDNA sequences registered in NCBI RefSeq (Pruitt et al. 2014). For comparison, we also evaluated DataBase of Transcriptional Start Sites (DBTSS), in which each site was supported by experimentally determined positions of 5ʹ cap structures (Wakaguri et al. 2008). Considering base resolution positions and transcriptional directions, 40-nt promoter sequences, 34 nt upstream and 5 nt downstream flanking regions from a TSS, were excised from the human reference sequence GRCh37. The datasets consisting of 60,120 and 16,911 sequences from RefSeq and DBTSS, respectively, were prepared independent of the presence of CpG islands. In contrast to the sequence logo obtained from the former set, the latter clearly exhibited the initiator (Inr) motif YANW (Fig. 1) (Juven-Gershon et al. 2008). Here, degenerated nucleotides are indicated according to the IUPAC code. The pyrimidine-purine (YR) consensus was particularly prominent when the DBTSS data were used to excise the 40-nt promoter sequences. The A or R nucleotide in the sequence motifs represents the transcription start point. Since the former alignment failed to detect the Inr motif and the dinucleotide consensus, it is unlikely that the 5ʹ ends of many RefSeq entries represent bona fide TSSs. Hence, we employed DBTSS to determine positions and directions of promoters, as well as to decipher whether a CpG island functioned as a promoter.
Divergence of Human and Mouse Promoter Sequences
Generally, one gene locus has many TSSs, which form a cluster around a core promoter. Considering frequency of usage and genomic position, a single TSS is selected as a representative for each core promoter in DBTSS (Wakaguri et al. 2008). For each human representative TSS, the flanking 300-nt sequences from both upstream and downstream regions were excised from the reference sequence, taking into account the transcriptional direction, to form a set of 601-nt promoter sequences, ultimately generating 20,612 promoter sequences. Of these sequences, 16,911 were associated with RefSeq accession prefix NM, i.e., curated model of protein-coding genes. Using these 16,911 promoter sequences as queries, we searched the mouse reference sequence GRCm38 for orthologous promoters that are relatively conserved between the two species. BLAT using default settings listed 8601 hits. We then eliminated the sequences with a mismatch at the YR consensus dinucleotide between humans and mice. Applying this careful screening, 2739 promoter sequences were obtained, which was comparable (3197) to that reported in a previous study (Jiang et al. 2007). Of these, 336 predicted TSSs coincided with the mouse data deposited in DBTSS. This approach missed most mouse promoters, suggesting a high degree of sequence divergence between the human and mouse promoters. Although a small number was anticipated, it was too small to perform comparative genome analyses. As another possible model organism, the rhesus monkey, Macaca mulatta, was selected, which is expected to have moderate sequence conservation and moderate divergence compared to the human genome (Yan et al. 2011; Zimin et al. 2014). While 5833 transcripts have been released as UCSC refGene for the macaque, 16,662 hits and 12,598 highly possible promoters were obtained by the BLAT alignment.
In addition to human data, sequence logos of mouse and macaque data were drawn (Fig. 1). For the latter two species, TSSs were inferred from human representative TSSs compiled in DBTSS. Clear YR consensus sequences were obtained, suggesting this method to effectively predict TSSs for organisms without experimentally determined TSSs. Those promoter sequences, however, should be conserved to some extent, e.g., desirably to the degree seen between humans and macaque.
The upstream regions from the TSSs appeared to be G + C-rich for the three species, although CpG-island promoters were not selected in this analysis (Fig. 1). Interestingly, from position − 25 to − 29, enrichment of G and C was lost for all species. Because the position corresponds to that of the TATA box (Juven-Gershon et al. 2008), we searched the upstream regions for its consensus sequence (Fig. 2). To avoid TA dinucleotide repeats, a 6-nt sequence TATAAA was used as a query in this search. In the histogram showing distribution of the consensus sequence, the most frequent position was at position − 31, which was clearly observed when DBTSS data were employed. TSS positional coincidence between RefSeq and DBTSS data was observed in only 312 genes out of 13,772 human transcripts, in which TSS data were available in both databases. In more than 90% of the gene loci, RefSeq start positions were explicitly situated upstream from the corresponding DBTSS data (Fig. 3). These results suggest an instrumental feature for DBTSS to process promoter sequences. Therefore, we employed the DBTSS data for further analyses.
Definition of a CpG-Island Promoter
Several definitions of a CpG island have been proposed (Gardiner-Garden and Frommer 1987; Takai and Jones 2002; Wu et al. 2010). While the existence of G + C-rich Alu sequences in primate genomes tends to complicate the definition, the role of Alu in gene expression is generally considered to be insignificant. Accordingly, they were shown to be absent in the vicinity of TSSs (Yamashita et al. 2005). Hence, we used 300-nt flanking sequences to form 601-nt promoter sequences, which fully covered the 500-bp core portion of a CpG island, and defined the sequence as a CpG-island promoter if the G + C content was greater than 0.5, and the CpG score was greater than 0.6. Here, the ratio of observed over expected CpG numbers represented the CpG score, as in the UCSC database. Any 601-nt promoter sequences that failed to meet these criteria were deemed non-CpG-island promoters.
Changes in G + C content and CpG scores between the human and macaque promoters are illustrated as a scatter plot (Fig. 4, Supplementary Fig. S1). Orthologous counterparts were connected by a single line, while vertical and horizontal lines were drawn to discern the two promoter types. CpG-island promoters formed an apparent cluster in the top right region, in which G + C content and CpG score were more than 0.5 and 0.6, respectively. In addition, the two lines partitioned the plot into four regions, namely, low-G + C/high-CpG (left top), high-G + C/high-CpG (right top), low-G + C/low-CpG (left bottom), and high-G + C/low-CpG (right bottom). The majority of promoter pairs were located in the second region, CpG-island promoters. Others were principally located in the two bottom regions or low-CpG promoters. Appearance of low-G + C/high-CpG promoters was exceptional in humans and macaque. Large numbers of changes were observed between high-G + C/high-CpG and high-G + C/low-CpG (right top–bottom), as well as between low-G + C/low-CpG and high-G + C/low-CpG (right–left bottom). Connecting lines that span a border of two promoter groups were counted and illustrated (Fig. 4). It is likely that G + C contents and CpG scores have drifted evolutionarily in either of the following two ways: changes in G + C content in low-CpG genomic context, or CpG content in high-G + C genomic context. Performing a gene ontology analysis revealed that such liable promoters were preferentially found in genes associated with alternative splicing (150 of 258 hCmN and 99 of 158 hNmC).
The number of 601-nt promoter sequences detected in the macaque genome was 12,543, which were then realigned to the corresponding human queries with ClustalW 2.1 to examine base changes between the two species. Pairwise alignments with alignment scores less than 90 were discarded from further analysis. Additionally, those with a mismatch in the YR consensus dinucleotide were also discarded. More than 97% of promoter pairs between human and macaque preserved their promoter types, i.e., CpG-island or non-CpG-island promoters (Table 1). However, the TBCE gene represents one of the rare cases that showed a discrepancy of promoter types (Fig. 5). Although the orthologous promoter sequences were highly similar between the two species, a major difference was a 47-bp indel near the 3ʹ end. BLAT search for the 47-bp fragment using chimpanzee, gorilla, baboon, and marmoset reference sequences revealed that a deletion event may have occurred in a common ancestor of gorilla, chimpanzee, and humans. Compared to the macaque sequence, four CpG sites were lost by the deletion in the human TBCE promoter, resulting in transition from a CpG-island promoter to a non-CpG-island promoter in hominids, namely great apes.
CpG Mutation Spectrum
In vertebrate genomes, CpG sites are subject to cytosine methylation, often followed by deamination and mutation to TpG (or CpA if deamination occurred in the complementary strand). In the human genome, the frequency of the CpG dinucleotide was extremely low among all 16 dinucleotide sequences. Conversely, frequencies of resultant TpG and CpA were the highest, except for A-rich or T-rich ones, i.e., ApA, ApT, TpA, and TpT (Okamura et al. 2007). We then examined alteration rates of the 16 types of dinucleotides between the two species (Fig. 6) and found the most frequent alteration to be in the CpG dinucleotide aligned to the non-CpG-island promoter of the counter species. In contrast, CpG sites in CpG-island promoters were relatively conserved between the human and macaque genomes. Further, many alterations in TpA dinucleotides were observed if aligned to CpG-island promoters of the counter species.
As for CpG sites in CpG-island promoters, more than 80% of the dinucleotide sequences were conserved between the two species. In particular, 85.7% and 86.7% of the CpG sites in human and macaque CpG-island promoters, respectively, were aligned without mismatch (Fig. 6, Supplementary Table S2). The slight difference in these percentage values arose as the total number of CpG sites between the counterparts was not identical. However, dinucleotide conservation was not observed in non-CpG-island promoters. In type-change promoter pairs, a large number of CpG-to-TpG/CpA transitions were found. The dinucleotide changes were observed 2.8 and 3.5 times more frequently in humans and macaque, respectively, than those in CpG-island promoters. The frequencies were also higher than those in background genomic sequences. Thus, it is likely that deamination mutations contributed to promoter variation.
Characterization of Inserted and Deleted Sequences in Promoters
While most of the human–macaque promoter pairs consisted of the same type between the two species, i.e., CpG-island or non-CpG-island type, 417 pairs (158 pairs + 259 pairs) showed discordance (Table 1). In addition to nucleotide substitutions, many insertions and deletions were detected in their sequence alignments, suggesting that such events could have evolutionarily altered the promoter types as shown in the TBCE locus. Additionally, multiple sequence alignments of the PNP and CEND1 promoters are provided as examples of insertion by repeat expansion (Fig. 5). From all of the indels, we selected 439 sites whose neighboring 10-bp regions on both sides showed 90%, or greater, similarity between human and macaque genomic sequences. Approximately 10% were 10-bp or longer indels (Fig. 7). Alterations in the numbers of CpG dinucleotides, including in the nearby regions, were observed in 82 indel sites. Among these, only 16 sites showed two or more CpG-site alterations, including six sites harboring a 20-bp, or longer, indel. In CpG-island promoters, long indels often contained several CpG sites. If an inserted or deleted fragment contained CpG sites, the event could contribute to gain or loss of CpG-island-like sequences. For all the detected indel sequences longer than 4 nt, Harr plots were drawn along with neighboring sequences (Fig. 5, Supplementary Table S1, Supplementary Fig. S2). Of all 88 indels, 77 were associated with tandem repeats, including 16 indels with inverted repeats (Supplementary Fig. S2). In total, repetitive sequences were found in 365 out of the 439 indel sites, including shorter indels (Supplementary Table S3). More than 50% of the indels were single-nucleotide insertions or deletions (Fig. 7). In many cases, they could be considered to be an extension or shortening of homopolymeric repeats. Inserted or deleted single nucleotides clearly depended on promoter types; while nucleotide bias was not observed in non-CpG-island promoters, frequent C or G indels were identified in CpG-island promoters.