Introduction

Genome evolution is a major driver of biological diversity. The mechanisms of these changes in both coding and non-coding sequences and their impact on the species evolution, have been extensively studied [1,2,3,4]. One of these processes is the emergence of novel genes which can occur de novo from non-coding DNA, as a result of different sequence rearrangements producing genes with new functionality, and by duplication of existing genes. The importance of gene duplication has been emphasized since publications by Nei [1] and Ohno [2]. Furthermore, studies by de Koning et al. indicated that approximately 70% of the human genome consists of repetitive sequences, the vast majority of which are transposable elements [5], underlining the importance of studying these once underestimated components.

Gene duplication can occur by retrotransposition, a process of reverse transcription of messenger RNA (mRNA) and the subsequent integration of the resulting complementary DNA (cDNA) into the genome. The proteins required for this process are provided by several retrotransposable elements, e.g., long interspersed nuclear elements 1 (LINE1) [6, 7]. These proteins bind to the mRNA, forming a complex that is transported back to the nucleus, where it anneals to double-strand breaks, undergoes reverse transcription, and is incorporated into the genome [3, 4, 8]. The resulting replicas (retrocopies) are characterized by the presence of poly(A) tracts, the absence of introns and regulatory components, and the repetitive sequences flanking the inserted sequence [9].

Retrocopies, which generally lack promoters, are regarded as ‘dead on arrival’, i.e., non-functional copies of their parents [10]. To become functional, the retrocopies have to be expressed and therefore have to acquire regulatory elements. One way to obtain this is to ‘hitchhike’ on the regulatory elements of other genes [11]. Indeed, many retrocopies are found nearby or within other transcribed genes [8]. A retrocopy may be also inserted downstream of pre-promoters that have evolved into functional elements over time, or it can acquire a distant promoter by gaining a new 5’ exon from the vicinity of the insertion site [11, 12]. In some cases, a retrogene may also obtain a promoter from its progenitor if the parental gene is transcribed from the site upstream of the canonical transcription start site [13].

Many of these retroposed and transcriptionally active copies evolve neutrally because they do not encode proteins. Other, with intact coding sequence, may encode a protein that is beneficial to the organism and fulfil functions comparable to parental genes (subfunctionalization) [2, 14, 15]. However, the acquisition of a new role through evolution (neofunctionalization) is also very common [4, 16, 17]. As some studies have shown, retrocopies can also occasionally functionally replace their progenitors [18, 19].

Similarly to other retroelements, such retrotransposons and retroviruses, retrocopies provide genetic material that may bring an adaptive benefit and contribute to intra- and interspecies differences [20,21,22]. On the other hand, retroposition also pose a threat to genome integrity. An inserted retroelement may disrupt exonic sequence, interfere with splicing, affect transcriptional machinery [23]. In addition, not only transposable elements but also viral genetic material can be incorporated into the DNA of cells. Therefore, controlling this constant threat of RNA-derived elements invasion is fundamental to genome integrity. Developed defense strategies are usually based on chromatin silencing factors, such as small RNAs that bind to their targets or sequence-specific DNA-binding proteins [24]. The germline and pluripotent stem cells are primarily protected by PIWI-interacting RNAs (piRNAs) [25] and KRAB-containing zinc-finger proteins (KRAB-ZNFs) [26], whereas in differentiated cells, the human silencing hub (HUSH) complex is the most active one [27, 28]. HUSH is composed of transgene activation suppressor (TASOR), M-phase phosphoprotein 8 (MPP8), and Periphilin (PPHLN1, isoform 2) and has the ability to successfully silence LINE1s as well as retroviruses through the chromatin modification and histone H3 lysine 9 trimethylation (H3K9me3) [29,30,31]. Seczynska et al. found that sequences repressed by the HUSH complex can often be characterized as long, intronless, transcriptionally active transposable elements with a high level of adenine on the sense strand [27]. This critical genome defence strategy and ability of HUSH to target retroposed cellular mRNAs could have a significant impact on the evolution and expression of functional retrocopies. Retroposition of cellular mRNA is a primary mechanism of the new gene formation, and therefore, HUSH-mediated repression may play a key role in the functional evolution of these new genetic materials. The aim of this study is to examine the evolution of different classes of protein-coding genes’ retrocopies in the context of the HUSH regulation.

Materials and methods

Data source

Analyses were performed based on the human and pig sets of retrocopies deposited in RetrogeneDB2, a database of retrocopy annotations in eucaryotic genomes developed in our laboratory [32]. Parental genes sequences were downloaded from the Ensembl database (release 105) [33]. Ensembl annotations were also used to identify protein-coding retrogenes.

Retrocopies deposited in RetrogeneDB were identified based on similarities between the reference genomic sequence and proteins encoded by multiexon genes. Several criteria were applied to filter the results and increase accuracy. It was required that at least two introns were lost and the alignment had at least 150 bp, at least 50% identity and covered at least 50% of the parental protein [32].

RNA-seq data analysis

For human we utilized genes expression estimation from previous studies performed in our laboratory based on 818 ENCODE RNA-seq libraries [16, 34]. 205 samples representing normal tissue were selected from this set. Raw reads from 15 porcine RNA-seq experiments were downloaded from publicly available databases, such as SRA NCBI [35], ENA EBI [36], or ENCODE [37]. It was required that selected RNA-seq datasets were composed of pair-end reads with 50 bp minimum length and originate from normal tissues or organs. A list of 205 human and 15 porcine analyzed libraries is shown in Table S1. The processing of RNA-seq reads was the same as previously for human data [16, 34]. First, reads went through quality control steps using FastQC [https://www.bioinformatics.babraham.ac.uk/projects/fastqc/] followed by quality filtering, quality trimming, and adapter clipping utilizing BBDuk2 from BBTools package (Joint Genome Institute; https://jgi.doe.gov). The following parameters were set up for this step: qtrim = w, trimq = 20, maq = 10, rref = adapters.fa (a built-in set of Illumina adapters), k = 23, mink = 11, hdist = 1, tbo, tpe, minlength = 2/3 of raw read length, removeifeitherbad = t, which are thoroughly described on the tool’s website (https://jgi.doe.gov/data-and-tools/software-tools/bbtools/bb-tools-user-guide/bbduk-guide/). The reads originating from ribosomal RNA (rRNA) were filtered based on mapping with a set of human and porcine rRNA sequences obtained from Ensembl [33] and Refseq [38]. This step was performed using Bowtie 2 [39]. To establish a particular type of RNA-seq library and to ensure that only pair-end sets are further analysed, we used Bowtie and infer_experiment.py from the RSeQC package [40]. After downloading and preparing he porcine transcriptome from Ensembl (release 105), the expression levels for transcripts were estimated with Salmon v0.7.2 [41] using most of the default parameters, except for: --seqBias and --gcBias. The TPM (transcripts per million) values obtained for all the transcripts assigned to each gene were then summed using a Python script and combined with the RetrogeneDB2 annotations. Retrogenes annotated as known protein-coding genes were considered to be expressed. Retrogenes annotated as pseudogenes had to meet the following criterion to be counted as expressed: expression level ≥ 1 TPM in at least three or two RNA-seq libraries for human and pig, respectively.

Analysis of protein-coding retrogenes origin

The group of human protein-coding retrogenes was analysed using the GenTree database to determine time of their origin [42]. Retrocopies were assigned to the branches of the phylogenetic tree based on their Ensemble ID. All retrogenes that originated after the Simiiformes branch split were recognized as young and specific for primates.

Identification of orthologs of HUSH complex components

Porcine orthologues of all components of the human HUSH complex - TASOR, MPP8, MPHOSPH8, and PPHLN1 - were identified based on the NCBI HomoloGene database (HomoloGene, https://www.ncbi.nlm.nih.gov/homologene). Orthology was also confirmed by reciprocal blastp search [43].

Calculation of nucleotide content

To calculate the adenine, thymine and GC content (A-content, T-content, GC-content) in the analyzed retrogenes and their parental gene sequences, the corresponding FASTA files for human, and pig retrocopies were downloaded from the RetrogeneDB2 database [32] and from the Ensemble database in the case of the parental genes. To calculate the GC-content in the retrocopy flanking regions, 5000 nt downstream and upstream of the retrocopy site were extracted from the human genome (Ensemble release 105). Subsequently, the computation step was conducted using the seq.kit script developed by W. Shen et al. [44].

Substitution analysis

To evaluate the rate of codon substitution at different codon position, first sequences of a retrocopy and a parental gene were aligned with tblastn to generate alignments at amino acid level. When the similarity between retrogene and parental gene was relatively low, retrogene nucleotide sequence was aligned to parental protein using blastx. Next, cognate coding sequences were aligned guided by amino acid alignment. This ensured that codons were aligned properly. Finally, the number of substitutions at each codon position was counted and the substitution rate was calculated. This has been done using a custom perl script. The type of the substitution was examined using in house Python script.

Statistical analysis

Adenine and thymine content and expression values were examined using GraphPad Prism 5 (GraphPad, San Diego, CA, USA, www.graphpad.com) and R [45] with the “ggsignif” [46], “ggplot2” [47], “ggh4x” [48], and “smplot2” [49] packages. First, the Shapiro-Wilk normality test was used to determine whether the data had a normal distribution. The Kruskal-Wallis test was with Dunn’s multiple comparison test was then used to compare differences between adenine content and expression levels in protein-coding, expressed, and non-expressed retrocopies, as well as parental genes, in both species studied. The relationship between retrocopies adenine content and expression level was studied using the Spearman correlation test. Finally, to determine how adenine content and expression levels vary between retrocopies and their parental genes, the Mann-Whitney U test or the unpaired t-test with Welch correction was conducted. In all analyses, p < 0.05 was considered statistically significant.

Visualisation

The graphs were prepared using R [45] with the four packages mentioned above [46,47,48,49], as well as the “tidyverse” package [50]. For transparency and to improve the quality of the graphs, approximately 0.03% of the outliers with the highest values were removed from each of the expression datasets. The operation did not affect the significance of the presented data. The phylogenetic tree was made using an online graphic design tool, Canva.

Results

Expression of retrocopies and parental genes

The 4611 human retrocopies were downloaded from RetrogeneDB2. Retrocopies recently retired from the Ensembl database [33] were excluded from the analyses, resulting in a final set of 4463 retrocopies [32]. Nevertheless, the number of retrogenes is likely to be underestimated due to the stringent requirements that were applied to the retrogene identification process in RetrogeneDB2. These retrocopies originated from 1503 parental genes. As many as 1340 retrocopies originated from RPL and RPS ribosomal proteins, which is not surprising [51]. The genes with the highest number of retrocopies include: RPL21 (108), PPIA (88), RPL23A (68), KRT18 (67), HNRNPA1 (66), RPL7 (55), HMGN2 (55), RPS2 (48), RPL31 (46), and RPL12 (43).

The expression level of retrogenes and their progenitors was estimated based on publicly available RNA-seq data. Based on the annotation and expression data, retrocopies were divided into three categories: protein-coding retrogenes and non-coding retrocopies, which were further divided into expressed and non-expressed retrocopy subgroups. Throughout the manuscript, these groups are referred to as protein-coding, expressed and non-expressed. The first subgroup (protein-coding) was the least numerous, accounting for only 2.38% of all retrocopies. Expressed and non-expressed retrocopies accounted for 42.93% and 54.69% respectively. Taken together, nearly 50% of human retrocopies demonstrated transcriptional activity.

Similarly to the retrocopies, their progenitors were also divided into three groups according to the category of retrocopy they produced. Some genes were placed in two or all three groups because they produce multiple retrocopies with different statuses. Comparison of expression levels showed that retrocopies have on average lower expression than their progenitors, and that protein-coding retrogenes have significantly higher expression than the subgroup of expressed retrocopies. However, the three groups of parental genes did not differ (Fig. 1).

Fig. 1
figure 1

The expression of retrocopies and their progenitors. Values were transformed to log2 for visualization purposes. **** p ≤ 0.0001, ns – not significant

The low level of expression of the retrogenes may indicate that they are under HUSH control similar to other retroposed sequences. Parental genes have multiple introns that protect them from the repressive HUSH effect. However, retrocopies have a much simpler structure than their cognate genes, in most cases only a relatively long single exon. To investigate this, we checked if there was a correlation between the length of the retrocopy genomic sequence and expression. In the group of protein-coding retrogenes, there was no correlation (Fig. 2A). However, in the case of expressed retrocopies, a significant negative correlation was found (Fig. 2A). This is in agreement with the work of Seczynska et al. [27], which showed that longer but intronless sequences are more susceptible to HUSH silencing. To clarify the lack of correlation in the group of protein-coding retrogenes, the expression of single-exon and multiexon molecules was then compared and significant differences were found. Single exon protein-coding retrogenes had significantly lower expression levels than those containing at least one intron (Fig. 2B), confirming the protective role of introns.

TPM values do not take into account differences in sequencing depth. To ensure that our results are not biased by this issue, we repeated some analyses at the individual sample level. We compared protein-coding retrogenes with expressed retrocopies in all individual libraries. In each sample, protein-coding retrogenes had higher expression level (not shown). Similarly, we calculated the correlation coefficient between expression and the length of the expressed retrocopies. The correlation was always negative and in the vast majority statistically significant (not shown).

Fig. 2
figure 2

Retrocopy genomic sequence length and expression level. (A) Expression correlation for protein-coding retrogenes and expressed retrocopies. (B) Comparison of the mean expression level between single- and multiexon protein-coding retrocopies. Values were transformed to log2 for visualization purposes. * p ≤ 0.05

Retrocopies sequence composition and expression

It was noted that the adenine content (A-content) of the gene sense strand was positively correlated with the silencing by HUSH [27]. Therefore, the A-content was calculated in all groups of retrocopies and parental genes. It was determined that protein-coding retrogenes have the lowest level of adenine compared to the remaining two types, expressed and non-expressed. The mean adenine content was 26.98%, 30.33%, and 29.75%, respectively, and the differences observed between all groups were statistically significant (Fig. 3A). In accordance with the expression level, there was no variation in the average A-content between the three groups of parental genes (not shown). Interestingly, protein-coding retrogenes did not differ from their progenitors in this respect, but in the case of other two categories, the parental genes had a significantly lower fraction of adenine than their retrocopies (Fig. 3B).

Fig. 3
figure 3

Adenine content (A-content) and thymine content (T-content) in human retrocopies. (A) Comparison of the A-content in three groups of retrocopies, (B) between retrocopies and their progenitors, (C) comparison of the T-content in three groups of retrocopies and (D) between retrocopies and their progenitors, **** p ≤ 0.0001, ns – not significant

In the context of HUSH silencing, these results appear to be in concordance with the expression level analysis. Protein-coding retrogenes have not acquired as many adenines as other recopies and may be therefore less susceptible to the influence of HUSH, and consequently achieve a higher level of expression, although still not as abundant as their precursors. However, if the high A-content on the sense strand would be selected for by HUSH silencing, no differences should be observed on the opposite DNA strand. To check this, we also analyzed the T-content on the sense strand, which reflects A-content on the other strand. Interestingly, there are no differences in a T-content on the sense strand between coding and expressed retrocopies, and their parental genes. The non-expressed retrocopies have a higher amount of T than their progenitors, although the difference is less significant than in the case of A-content. However, there are no significant differences between retrocopies categories (Fig. 3C-D). Consequently, with changes in the A-content, the GC-content of non-coding retrocopies is significantly lower compared to their progenitors (Fig. 4A). Despite this, all retrocopies have a higher GC-content than their surroundings, regardless of whether they are in the intergenic region or in the intron of another gene. The latter is quite common in retrocopies (Fig. 4B) [8]. This is because they inherit GC-content from their progenitors. Protein coding sequences are known to have high GC-content sequences compared to introns and intergenic sequences [52]. In addition, retrocopies tend to arise from genes with even higher GC-content, especially at their 5’ ends [8].

Fig. 4
figure 4

Comparison of the GC-content (A) Between retrocopies and their progenitors and (B) Between retrocopies and their surroundings, **** p ≤ 0.0001, ns – not significant

To further clarify the phenomenon of high A-content in non-protein-coding retrocopies, we calculated the Spearman correlation coefficient between A-content and retrocopy expression and found no correlation in either group - protein-coding retrogenes and expressed retrocopies (not shown). Nevertheless, it is plausible that there is no direct correlation, and that some threshold level of adenine must be reached to render a retrocopy susceptible to HUSH silencing. Therefore, to investigate the relationships between A-content, expression, transcript length, and gene structure, we divided the expressed retrocopies into six groups. The classification was made according to the genomic structure (single or multi-exon) and content of adenine (low – below the 25th percentile, medium – between the 25th and 75th percentile, and high – above the 75th percentile of A-content values) (Fig. 5). In each group, we calculated the correlation between the length of the transcript sequence and the expression. Interestingly, in the group of single exon retrocopies, there was a negative correlation between gene expression and transcript length regardless of adenine content (Fig. 5A-C). However, retrocopies with introns showed no significant correlation between expression and transcript length in any of the groups analyzed (Fig. 5D-F). Based on these results, it could be concluded that the exon length may be the factor that makes retrocopies susceptible to HUSH silencing and the presence of the intron may have some protective effect. However, they do not confirm a direct relationship between A-content and the expression level and suggest that the A-content, although significantly higher than in parental genes, is not an important factor.

Fig. 5
figure 5

Transcript length and expression correlation in six groups of expressed retrocopies: single exon with (A) low (below 25th percentile), (B) medium (between 25th and 75th percentile), and (C) high (above 75th percentile) A-content and multi-exon with (D) low, (E) medium, and (F) high A-content. The expression values were transformed to log2 for visualization purposes

Substitution pattern

Retrocopies are known to be ‘dead on arrival’, i.e. they are transcriptionally inactive after retroposition due to the lack of regulatory elements. They are therefore not under evolutionary pressure and accumulate mutations. The elevated levels of adenine in both groups of non-coding retrocopies (expressed and non-expressed) may indicate that these duplicates have evolved freely without any evolutionary pressure. To determine which substitutions contributed the most to adenine accumulation and to identify differences between protein-coding retrogenes and other retrocopies, all types of nucleotide changes were counted. In protein-coding retrogenes, T > C; A > G substitutions are the most common, followed by G > A; C > T changes (Fig. 6). Interestingly, it is opposite in both expressed and non-expressed retrocopies, the dominant substitutions are G > A; C > T and they are followed by T > C; A > G changes (Fig. 6). This result is similar to other nucleotide substitution studies in pseudogenes and the pattern was found to be the same regardless of the background GC composition [53].

Fig. 6
figure 6

Directions of nucleotide substitutions in all analyzed groups of retrocopies

We also checked the frequency of substitutions at different codon positions in the protein-coding retrocopies. The total length of the aligned amino acid sequences was 82,998 amino acids (excluding gaps), or 248,994 nucleotides respectively. As expected, the highest substitution rate (0.034) was at the third codon position, followed by first codon position (0.016) and the second one (0.013). This is a typical behavior of genes evolving under negative selection and consequently implying functionality of the analyzed genes [54, 55].

Retrogenes in the pig genome

The HUSH complex is conserved from fish to mammals, so we investigated whether similar observations could be made in the case of different mammalian species. Although there is a wealth of data available for the mouse, we deliberately chose the pig for comparison as it is a more distant species. The 1026 retrocopies were downloaded from RetrogeneDB2. It is a significantly lower number than for humans, partly due to the burst of retroposition in primates [56] and partly due to gaps in the annotation of the pigs’ genome. Nevertheless, the number is quite similar to other estimates [57]. The fractions of protein-coding and expressed retrocopies are higher compared to human, reaching 6.77% and 54.45% respectively (Fig. 6A). Two major factors contributed to these differences. First, the burst of retroposition in primates resulted in a large number of young and inactive retrocopies. Second, protein-coding retrogenes are old and mainly shared between mammals. Therefore, they make up a larger proportion of a smaller set of retrogenes. Analysed retrocopies originated from 508 parental genes. Ribosomal protein genes yield 278 retrocopies however, the gene with a highest number of retrocopies is FTL (33). It is followed by RPL17 (30), RPLP1 (16), RPL9 (15), SUMO2 (15), RPL11 (14), RPS25 (12), RPS20 (12), FTH1 (9), and RPL32 (9).

RNA-seq data analysis revealed that the expression level of protein-coding retrogenes is elevated compared to the group of expressed retrocopies (Fig. 6B). These results are consistent with previous conclusions based on human data. In addition, as in humans, the expression of parental genes does not vary between genes that produce different types of retrocopies (Fig. 7B). We then examined the expression pattern of single and multi-exon protein-coding retrogenes. In pigs, as in humans, more complex retrogenes were more abundantly expressed, although the differences in the mean expression levels were not statistically significant (Fig. 7C).

Adenine content analysis also gave results consistent with the analysis of human retrocopies. The A-content is significantly lower in protein-coding retrogenes and there is no difference between the two other groups of retrocopies. Also, while protein-coding retrogenes do not differ from their progenitors in the average adenine content, there is a significant difference in the case of the remaining two groups of retrocopies (Fig. 7D-E). The substitution patterns resemble those observed in humans (Fig. 7F).

Fig. 7
figure 7

Analyzes of pigs’ retrocopies corresponding to previous calculations in humans. (A) The percentage of studied retrocopy subtypes in human and pig, (B) The expression of pigs’ retrocopies and their progenitors, (C) Comparison of the mean expression level between single- and multiexon protein-coding retrocopies, (D) Comparison of the A-content in three groups of retrocopies and (E) between retrocopies and their progenitors, (F) Percentage of individual substitutions in all nucleotide changes. Values were transformed to log2 for visualization purposes. **** p ≤ 0.0001, *** p ≤ 0.001, ** p ≤ 0.01, ns – not significant

Discussion

The high level of retrotransposition, accompanied by complex mechanisms of the development of new functions, confirms the impact of RNA processing and RNA-directed rewriting of DNA on the evolution and phenotypic diversity of organisms. Retrocopies have been shown to significantly influence the diversification of transcriptomes and proteomes, earning them the title of ‘seeds of evolution’ [58, 59]. Studies of young retrogenes have shown that these sequences played a substantial role in, e.g., evolution of brain in primates [60] and Drosophila melanogaster [61]. Also, these new additions developed unique spatial expression patterns compared to the parental genes, and molecules derived from these retrogenes gained novel biochemical properties [60, 62, 63], and/or different subcellular localization patterns [60, 62]. This subcellular adaptation or relocalization process represents a new evolutionary pathway for the development of new gene functions [8, 64].

Retrogenes play a crucial role in genome evolution by providing novel genetic material, but they also pose a threat to genome integrity. As products of reverse transcription, they can be recognized as genomic ‘parasites’ and are therefore susceptible to repression by the HUSH complex, as determined by Seczynska et al. [27]. The researchers showed that the HUSH complex represses the products of reverse transcription inserted into the genome. They also showed that HUSH targets long, intronless, and transcriptionally active sequences in which the sense strand is rich in adenine [27]. The HUSH complex, TASOR, MPP8, and periphilin regulate the expression of retroposed sequences in an H3K9me3-dependent manner, meaning that transcription is required for the H3K9me3 initiation and propagation. Targets are localized by periphilin, which binds to RNA and enables HUSH to respond to increased transcription. Consequently, an increased amount of target RNA leads to further periphilin binding and intensified HUSH occupancy. This in turn recruits more SETDB1, a histone methyltransferase, and MORC2, an ATP-dependent chromatin remodeler that compacts chromatin [29, 65].

The HUSH complex recognizes evolutionarily young retroelements and provides an immediate defense mechanism against these genomic ‘invaders’. However, this evolutionary ‘war is fought on both sides: host and parasite. Over time, transposable elements have therefore evolved their own defense mechanisms, making them at least partially resistant to the influence of HUSH. Human immunodeficiency viruses type 1 and 2 (HIV-1 and HIV-2), for example, use their viral auxiliary proteins to counteract HUSH restrictions. The viral proteins Vpx and Vpr antagonize SAMHD1, a factor that inhibits the reverse transcription. These molecules bridge the DCAF1 ubiquitin ligase substrate adaptor to SAMHD1 for subsequent ubiquitination and degradation [66, 67]. It appears that Vpx and Vpr counteract HUSH repression by a similar mechanism - an induction of its proteasomal degradation through the recruitment of DCAF1 [68, 69].

In the present study, we analyzed retrocopies in the context of HUSH complex repression. Our analyses of retrocopy expression levels confirmed previous findings that most of these molecules, including protein-coding retrogenes, have low expression levels [70]. In addition, we showed that their expression is significantly reduced compared to their cognate genes. Retrocopies contain very long exons resulting from the mechanism of their origin. The above support the studies of Seczynska et al. [27] and indicate that low expression of retroposed genes may be resulting from HUSH repression. However, our results demonstrate that the retrocopies have found a way to ‘escape’ the silencing of HUSH. This is mainly due to the evolutionary fate of retroposed genes. Initially, most retrocopies are deprived of regulatory elements and are considered to be ‘dead on arrival’. To become transcriptionally active and thus targeted by HUSH, retrocopies need to acquire promoters. Published studies show that the vast majority of these retrocopies acquired a promoter de novo from a cryptic intergenic promoter (86%) [70]. Promoter acquisition is in many cases associated with the gain of a new 5’ exon, and it has been shown that many transcriptionally active retrocopies gained 5’ exons from upstream sequences. This implies the acquisition of introns, which are often very long. Exons can be acquired quite rapidly, and about 20% of young human retrogenes have non-parental 5’ exons. According to Seczynska et al., introns, especially long ones, protect against HUSH repression [27]. Therefore, as more complex structures are obtained, retrocopies also gain at least some immunity to HUSH.

This complex has also been shown to target sequences with a substantial amount of adenine in the DNA sense strand [27], which is consistent with the context of retroelements evolution and DNA methylation. Deoxycytosine methylation occurs at the cytosine of the CpG dinucleotide, producing 5-methylcytosine (5mC), which mutates to thymine by spontaneous deamination [71]. As result there is observed CpG decay and the increase in TpG and CpA dinucleotide frequency. It is known that in primate genomes, for example, more than 40% of CpG islands are found within repetitive elements [72]. Accumulation of adenine has been observed as a result of methylation in Alu retroelements [73]. This supports the finding that HUSH defends the genome against DNA invasion and targets sequences with high adenine content. However, our results show that this may not be true for retroposed genes. Protein-coding retrogenes have, on average, lower expression than their progenitors, but do not differ in adenine content. We also found no correlation between the adenine content and the expression level. Therefore, other factors, such as the presence of long exons, seem to be more important. The expression of single exon retrocopies decreases with increasing exon length, independent of the adenine content. The presence of a long exon does not seem to have such a negative effect on expression when the retrocopy, whether protein-coding or non-coding, has acquired an intron.

Protein-coding retrogenes had significantly lower amounts of adenine than the other two categories of retrocopies. It has previously been shown that the exonic sequences contain more CpG than intergenic [52] and intronic DNA [74], making them more susceptible to mutation. It has also been shown that CpG-containing codons are subject to greater purifying selection than less mutable sites at identical codon positions [75, 76]. In addition, high GC-content promote nuclear export of mRNAs, especially in intron-poor mRNAs, and is important in distinguishing functional RNAs from junk transcripts [77]. These GC-rich regions likely recruit protein factors such as the THO complex, SR proteins and RBM33, which recruit nuclear transport receptors [78]. The above highlights the differences between protein-coding retrogenes and the remaining two categories of retrocopies – expressed but non-coding and non-expressed. Protein-coding retrogenes probably gained promoters soon after retrotransposition before losing coding potential due to mutations, which immediately put them under selective pressure and preserved CpG-containing codons. As a result, the adenine content, and therefore also CG-content, of the protein-coding retrogenes does not differ from that of the parental genes. In contrast, in retrocopies that were not transcriptionally active for long periods of time or did not acquire promoters at all, both nucleotides at the CpG site were free to undergo neutral nucleotide substitution. In the absence of negative selection, TpG and CpA dinucleotides accumulated as a consequence of cytosine methylation and following mutations to thymine. Our results corroborate those of Subramanian and Kumar, who demonstrated the over-time decay of CpG in pseudogenes [52]. Non-coding retrocopies inherited a high CpG content from their protein-coding parents, and since they no longer code for proteins, these highly mutable sites have escaped selective pressure. Thus, even in a relatively short time, they could accumulate enough adenine to differ from their parents. The results of the studies of Seczynska et al. [27] suggest that sequences with high adenine content are more susceptible to HUSH silencing. However, our study showed no correlation between adenine content and the expression level in any group of retrocopies. On the other hand, there is a significant decrease in the CG-content, and since high CG-content has been found to correlate with the nuclear transport of intron-poor genes [77], this may be a more important factor responsible for the low level of expression of retrocopies than A-content. In addition, we cannot exclude other factors, such as promoter architecture. For example, it has been shown that the promoters of retrocopies have depleted CpG islands and are bound to fewer transcription factors than the original genes [79].

Conclusions

In summary, the results of our study show that the presence of long exons has a negative effect on the level of retrocopy expression. We have also shown that intron gain provides some protection against possible HUSH repression and makes the expression level less dependent on the transcript length. The above may suggest that retrocopies are under some control of the HUSH complex. However, we cannot exclude other factors, such as GC-content and/or promoter architecture.