INTRODUCTION

At the dawn of molecular biology, when little was known about the underlying molecular nature of biological phenomena, numerous theoretical papers were attempting to foresee future discoveries and to make viable predictions regarding molecular explanations of fundamental genetic concepts. Notably, only a relatively small fraction of such papers withstood the test of time and the eventual experimental scrutiny that followed in the years to come. Among such visionary papers, the theoretical prediction by Alexey Olovnikov of terminal DNA under-replication in linear chromosomes and of the specialized enzyme that could overcome this problem [12] occupies a well-deserved place. While simultaneous recognition of the end-replication problem is also credited to the paper by James Watson [3], its focus on phage DNA avoided the requirement for a specialized polymerase, shifting the emphasis on end-processing nucleases instead.

The Nobel prize-winning discovery of telomerase, the specialized polymerase which can add simple repetitive sequences to the ends of linear chromosomes to compensate for terminal DNA loss after each round of replication, has in turn followed a long and winding path. In the initial report by Greider and Blackburn, the discovered Tetrahymena enzyme was designated as a terminal transferase [4], because the detected activity was adding tandem repeats onto telomeric primers without an apparent template. An associated RNA template, however, was subsequently identified as an integral component of the ribonucleoprotein holoenzyme, providing experimental evidence in support of RNA-dependent DNA synthesis [5], although it was still considered premature to classify the telomerase enzyme as an authentic reverse transcriptase.

The process of DNA synthesis that uses RNA as a template is universally recognized under the term “reverse transcription”, and the corresponding enzyme that can perform this reaction bears the name “reverse transcriptase” (RT). Its experimental discovery by Temin and Baltimore more than 50 years ago [67], which was also recognized by a Nobel prize, was similarly preceded by Howard Temin’s conceptualization of DNA synthesis on viral RNA template, known as “the provirus hypothesis” [8]. Little did they know that in addition to discovering the reverse flow of genetic information from viral RNA to DNA, they also provided the foundation for the discovery of self-replicating movable genetic elements and for eventual realization that some of the accessory or even essential host functions can be taken over by the descendants of such mobile elements. Remarkably, RTs were discovered approximately at the time when the chromosome end under-replication problem first came to light (Fig. 1).

Fig. 1.
figure 1

The main types of reverse transcriptases (RT) from the three domains of life. a) Chronology of RT discovery. The main RT types described in the text are colored as follows: viral RTs, shades of red; RTs of eukaryotic mobile elements, shades of green; prokaryotic RTs, shades of blue; domesticated eukaryotic RTs, shades of purple. Domesticated RTs are underlined. The years correspond to the first reports of identification of homology to the RT catalytic core. The year 1971 marks the first report of the chromosome end under-replication problem [1]. b) Examples of structural organization of domesticated eukaryotic RTs. Bacterial retrons are included for comparison. The centrally positioned RT catalytic core is represented by the seven conserved motifs separated by spacers of variable length, with distinctively long insertion loops 2a and 3a (also called IFD) marked in red. The D..DD active site residues and their non-catalytic replacements are indicated. Additional domains on either side of the RT core and thumb are as follows: TEN, telomerase essential N-terminal domain; TRBD, telomerase RNA binding domain; CTE, C-terminal extension; P, polyproline stretch; NLS, nuclear localization signal; Bromo, bromodomain; PROCN, PRO8 central domain; Endo, endonuclease-like; Jab1/MPN, putative deubiquitinase-like domain. The scale is approximate. Domain composition is compiled from refs. [55, 57, 59].

EVOLUTION OF APPROACHES TO RETROELEMENT DISCOVERY

Since their discovery in retroviruses, RT diversity underwent an amazing expansion from purely viral constituents to a staggering variety of structural and functional roles in eukaryotic and prokaryotic hosts (Fig. 1a). After early advances in the field of virology, which led to further discovery of reverse transcription in the replicative cycles of hepadnaviruses and caulimoviruses (collectively named pararetroviruses [9]) and were facilitated by the availability of methods for virus isolation and biochemical RT assays, the discovery potential soon shifted towards detection of sequence homologies, spurred by the advent of sequencing technologies and the landmark identification of common amino acid sequence motifs in the catalytic core of DNA polymerases from reverse-transcribing viruses [10]. Since then, the search for the aspartates forming the D..DD catalytic triad at the RT active site has quickly become an integral part of identification of novel RTs. In the RT discovery timeline (Fig. 1a), the underlying publications in which the characteristic RT residues were first identified were given priority in comparison to those reporting initial biochemical detection of RNA-dependent DNA polymerization. This is because proper experimental validation of RT activity should inevitably include site-directed mutagenesis of the active site residues, present in two of the seven conserved motifs defining the RT catalytic core (Fig. 1b).

The first half of the timeline, prior to 1990’s, is represented mainly by RTs from various types of viruses and mobile genetic elements. Indeed, multicopy transposable elements were one of the first components of eukaryotic genomes to be cloned molecularly [1112], along with other actively transcribed multicopy genes such as ribosomal DNA repeat units or histone gene clusters [1314]. The overall structural similarity between LTR-retrotransposons and retroviruses immediately became apparent upon their cloning from Drosophila and yeast [15]. However, the definitive proof of their close relationship to retroviruses came from analysis of their complete nucleotide sequences identifying the coding capacity for the RT enzyme [1617]. Furthermore, characteristic blocks of homology to the RT conserved motifs were soon identified not only in retrovirus-like transposable elements, but also in fungal mitochondrial group II mobile introns and other types of multicopy eukaryotic transposons, such as DIRS and LINE-like retrotransposons [18-21]. To conclude the first two decades of RT research, the existence of RTs in bacteria was reported in the form of retrons, multicopy extrachromosomal DNA–RNA chimeric molecules connected through a 2′-5′ branchpoint [2223].

The next temporal phase in RT discovery, while also relying on detection of sequence homologies, was dominated by RTs present in lower copy numbers, most of which do not belong to transposable elements, but instead represent single-copy host genes (Fig. 1a, underlined). In fact, the currently known eukaryotic retrotransposon diversity has not expanded since the discovery of Penelope-like retroelements (PLEs) [24]. The first and most prominent case of RT domestication in eukaryotes emerged with the proof that telomerase represents a bona fide RT. Connecting the RT activity with the corresponding enzyme took a lot of time and effort, with mis-identifications along the way [25], but the ultimate success in identifying the telomerase catalytic subunit as an RT came with identification of the conserved motifs in the fingers and palm RT domains, validated by loss of activity upon site-directed mutagenesis of the three invariant catalytic aspartates [26]. Thus, a single-copy RT gene present in nearly all eukaryotic species was found to be responsible for an essential host function of elongating the ends of linear chromosomes to counteract terminal DNA loss from under-replication, or marginotomy, as it was originally named by Olovnikov [27]. Currently, new RT types are mostly identified by computational mining, taking advantage of the abundant genomic and metagenomic data. In the following sections, our aim is to briefly characterize the RTs which belong to mobile genetic elements, and to compare to those which are domesticated and accordingly non-mobile.

EUKARYOTIC MOBILE ELEMENTS: RETROVIRUSES, PARARETROVIRUSES, RETROTRANSPOSONS

To understand and compare the properties of viral and mobile RTs, we need to consider the architectural composition of conserved domains that occur in combination with RT, as well as the adjacent gene content within the mobilizable unit (Fig. 2). Interestingly, retroviruses, the discovery of which opened the era of RT research, turned out to be strikingly similar to LTR-retrotransposons, discovered over a decade later, in their gene content, organization, and replication cycle, pointing at their common evolutionary ancestry [16, 17, 28]. RTs of hepadnaviruses can be broadly assigned to the base of the viral/LTR branch of eukaryotic RTs, which harbors the C-terminal RNase H domain to ensure replication in the cytoplasm, avoiding the need to employ host nuclear RNase H enzymes for destruction of RNA in the DNA–RNA hybrid (Fig. 2). Even more unusual is the case of caulimoviruses, the RT of which is closely related to that of Metaviridae (aka Ty3/mdg4(gypsy)-like LTR retrotransposons), such that their ancestry is most likely of hybrid nature, resulting from RT capture by a DNA virus [29]. The Ty1/copia-like LTR retrotransposons (Pseudoviridae) conform to the general LTR structure, but show a different domain order. All retrovirus-like elements comprising the taxonomic order Ortervirales (Retroviridae, Metaviridae, Pseudoviridae and Belpaoviridae) [29] are mobilized with the aid of the integrase (IN), which is responsible for insertion of a cDNA copy into new chromosomal locations. A distinct group called DIRS elements mobilizes by using tyrosine recombinase (YR) instead of IN.

Fig. 2.
figure 2

Domain architecture of the major RT types described in the text. For each type, a typical architecture is presented as revealed by the CDART tool at NCBI [63]. Domain designation is according to the NCBI conserved domain database (CDD) [64]. The colors are assigned by the CDART tool dynamically rather than following each domain specifically; to facilitate homology tracing, the RT and RNaseH (RH) domains are connected with a dashed line. The circular arrangement follows the phylogenetic groupings in the center from ref. [55], with letters P, V, T, and L corresponding to prokaryotic, virus-like, telomerase-like, and LINE-like retroelements; RVT genes form a separate group which has no designation yet. Mobile elements contain six different types of associated nucleases/phosphotransferases mentioned in the text: IN, AP, REL, YR, GIY-YIG, HNH. Virus-like elements are named according to ICTV classification [29]. Domesticated eukaryotic RTs (TERT, RVT) are designated as Genes.

Non-LTR (or LINE-like) retrotransposons mobilize without producing a cytoplasmic cDNA intermediate: their RT uses the target-primed reverse transcription (TPRT) mechanism to synthesize cDNA directly at the chromosomal integration site nicked by one of the two different types of associated endonuclease (EN), either AP-like or REL-like. Finally, RTs of Penelope-like elements employ yet another EN type (GIY-YIG) for mobilization, bringing the number of retrotransposon-associated endonuclease types to five. A more detailed recent description of retromobility mechanisms can be found in [30].

PROKARYOTIC MOBILE ELEMENTS: GROUP II INTRONS, RETROPLASMIDS

Group II introns (G2I) are self-splicing retroelements found in bacteria, some archaea, and eukaryotic organelles [31]. First discovered in fungal mitochondria, they were shown to possess the same structural organization in bacteria and archaea, and are widely regarded as evolutionary precursors to eukaryotic spliceosomal introns. Their retromobility is ensured by the combined action of the catalytically active RNA, which functions as a ribozyme in the self-splicing and reverse-splicing reactions, and the intron-encoded RT, which synthesizes a cDNA copy of the intron RNA at the target site, using the TPRT mechanism.

Retroplasmids were found in fungal mitochondria [32] and for a long time served as a model system to study the unconventional priming modes by reverse transcriptases (protein priming, when RT uses the hydroxyl group of tyrosine or serine residues for priming, or de novo RT initiation, which does not use any primer at all). Their distribution is still quite limited, as there are only a few dozen fungal species harboring them, out of hundreds of sequenced fungal genomes. As extrachromosomal entities, they are not expected to undergo integration, but technically form part of the mobilome due to their ability to replicate autonomously.

NON-MOBILE RETROELEMENTS IN BACTERIA AND ARCHAEA: RETRONS, DGRs, Abi/UG, Cas-ASSOCIATED, G2I-LIKE

Retrons are peculiar domesticated bacterial elements composed of covalently linked RNA and multicopy single-stranded DNA (msDNA) in a single branched molecule connected by a 2′-5′ phosphodiester linkage [2223]. Each retron module encodes an RT protein sequence, a non-coding RNA which is reverse-transcribed by the RT to form the chimeric single-stranded DNA/RNA molecules, and an effector gene needed for anti-phage activity. Despite being the first prokaryotic non-mobile retroelements discovered over 30 years ago, the cellular function of retrons was elucidated only in 2020 [33-35]. Retrons confer host defense against a broad range of phages via abortive infection and subsequent cell death. They are widespread in bacteria, being one of the main components of bacterial immune systems. However, the exact mechanisms by which they confer phage resistance via reverse transcription are still unknown. The co-occurrence of RT in tripartite modules with template RNA and a variety of putative effector genes suggests their direct interaction in eliciting anti-phage response [36]. Indeed, such interaction was observed in a complex between RT, its cognate msDNA, and the linked effector nucleoside deoxyribosyltransferase [37].

Diversity-generating retroelements (DGRs) are non-mobile RTs that diversify adjacent target DNA sequences in bacteria, archaea, and viruses [3839]. Despite being non-essential retroelements, DGRs are nevertheless beneficial for their hosts. In the best-described model system, DGRs generate diversity in the C-terminal variable region of target protein gene (mtd) of the Bordetella pertussis bacteriophage BPP-1. The resulting hypervariability in the phage tail protein, the region that contacts the bacterial cell during infection, allows the phage to infect bacterial cells with altered surface receptors. By utilizing error-prone reverse transcription, DGRs help to increase diversity in gene products, especially those involved in ligand-binding and host attachment. It is still a mystery how the adenine specificity of targeted hypermutagenesis is accomplished. Moreover, inspection of adjacent genes in DGR modules suggests that hypervariability targets may not be limited to tropism switching and surface display [4041].

Abortive infection systems (Abi), represented by AbiA, AbiK, and Abi-P2, are bacterial retroelements that serve to protect certain bacteria from phage infections. These genes are only found in some Bacilli (mostly in Lactococcus lactis) genomes as plasmid-encoded genes (AbiA and AbiK), and on P2-like prophages in Escherichia coli (Abi-P2). While their detailed mechanism of action is still unknown, Abi proteins are required for blocking phage replication followed by programmed cell death or phage exclusion [4243]. Interestingly, the AbiK protein was shown to perform non-templated DNA polymerization in vitro and is covalently attached to DNA, which is indicative of protein priming [44]. Thus, Abi represent another, besides retrons, type of active RT which confers advantage to a subset of bacteria when attacked by phages. Of note, AbiP2 and AbiK RTs are exceptional in forming compact trimers or hexamers in solution, as well as in lacking the RT thumb domain, which is replaced by the all-helical domain composed of HEAT repeats [4546]. A substantial proportion of the so-called unknown groups (UG) [47], some of which were independently called DRT (defense RT) [33], were reported in earlier surveys as unassignable to a specific RT type, but were later found to be related to Abi RTs and to play a role in antiphage defense, with enrichment in the so-called defense islands, which contain a variety of other genes providing protection against invading foreign DNA [3345].

RT-Cas: RT domains were found near CRISPR-associated genes or even fused to Cas proteins [48-50]. Potentially, these RTs can confer bacterial immunity by performing cDNA synthesis on RNA from bacteriophages, and were indeed shown to mediate heritable acquisition of short sequence segments (spacers) from foreign RNA elements [51]. Fusion to Cas proteins is not necessary, although it allows more efficient cooperation of the interacting domains [52]. These RTs are not monophyletic, having been co-opted into CRISPR-Cas systems from several bacterial RT lineages [50].

Group II intron-like RTs (G2L), a heterogeneous group of non-mobile RTs that share sequence similarity with G2I but lack the ribozyme moiety, was first described in [48]. Recently, it was found that G2L RT from Pseudomonas aeruginosa (G2L4 RT) is involved in translesion DNA synthesis and double-strand break repair via microhomology-mediated end-joining (MMEJ) [53]. Interestingly, the substitution of YADD to YIDD in the G2L4 RT active site is responsible for a shift towards performing MMEJ instead of primer extension, which is characteristic for canonical G2I RTs with YADD at the catalytic site. Nevertheless, a canonical G2I RT was also capable of performing DNA repair.

NON-MOBILE EUKARYOTIC RTs AND THEIR DERIVATIVES: TELOMERASE, RVT, PRP8

Telomerase reverse transcriptase (TERT) (Fig. 1b), as described above, is undoubtedly the most well-known RT with a crucial cellular function. Based on the main function of maintaining the length of linear chromosomes, it has well-described roles in aging, cancer, and other human diseases (aplastic anemia, Cri du chat syndrome, Dyskeratosis congenita, etc.). Multiple approaches are being developed to target active telomerase and the associated TERT RNA template pharmaceutically in the context of anti-cancer therapy and age-related diseases (recently compiled in [54]).

Reverse transcriptase-related genes (rvt) (Fig. 1b) are the most recently discovered type of domesticated eukaryotic RTs widespread in fungi and sporadically occurring in selected plants, protists, and invertebrates [55]. Strikingly, these genes are present in both prokaryotes and eukaryotes, in contrast to all other RT types. Notably, RVTs from all bacterial phyla form a monophyletic group, suggesting that they were not horizontally transferred from eukaryotes as initially thought, but may have been present in Bacteria prior to eukaryogenesis [56]. Rvt genes encode active RT-like proteins that in fungi can polymerize both dNTPs and NTPs. RVT proteins are also capable of protein priming. While biological function of rvt genes is not yet fully understood, they are clearly preserved by natural selection, indicating their importance for host cells. These genes are strongly activated by starvation and certain antibiotics in fungi, suggesting their involvement in response to these agents [55].

Pre-mRNA-processing factor 8 (Prp8) is an unusual domesticated RT derivative that lost two out of three catalytic aspartates, thereby losing the ability to polymerize nucleotides [57]. Yet, Prp8 is an essential part of eukaryotic spliceosome regulating its assembly and conformation during pre-mRNA splicing [58]. The RT moiety of Prp8 was proposed to originate from mobile group II introns [59], giving us one more example of how during evolution selfish retrotransposons can give rise to essential components of eukaryotic cells, in this case as a structural element which comprises the central U5-snRNA-binding part of a large multi-domain protein (Fig. 1b). The lack of catalytic residues and very high sequence conservation due to evolutionary constraints imposed by spliceosome function impedes unambiguous phylogenetic placement of this RT-derived domain, but its origin undoubtedly dates back to the last common ancestor of all eukaryotes.

CONCLUDING REMARKS

From the RT descriptions summarized above, it is easy to note that the RT types discovered in earlier years generally originated from abundant, high-copy-number sources – initially from viruses, and subsequently from cellular multicopy mobile genetic elements: from LTR, DIRS, and non-LTR retrotransposons in eukaryotes, to prokaryotic mobile group II introns and retroplasmids, and to retrons producing abundant branched DNA–RNA molecules in bacterial cells. Retromobility is typically conferred by a specific type of endonuclease associated with each mobile element, providing the means for intrachromosomal insertion of a cDNA copy. At the initial stages, many eukaryotic TEs were identified by their ability to cause insertional mutations with visible phenotypes in strains experiencing transposition of multicopy elements [60]. It is now clear that RTs can perform a large variety of functions besides their role in proliferation of selfish genetic elements. We argue that the diversity of domesticated RTs has been grossly underestimated and their role has been substantially undervalued, with plenty of opportunities existing for RT recruitment by the host cells despite their overall non-essential nature and patchy distribution. It is not surprising that sometimes it may take a long time, even decades, from initial identification of an element to the proper assignment of a host function, if the selective advantage to the host is conditional. The telomerase RT, a single-copy gene, represents a notable exception in being ubiquitously present throughout eukaryotes, and the revelation that it encodes a specialized RT, i.e., an enzyme previously thought to be characteristic only of viruses and mobile elements, has truly revolutionized the field [26]. Still, even the critical function of telomere maintenance can be supported by independent backup pathways [61].

It is worth emphasizing that RT domestication in eukaryotes is invariably associated with the appearance of additional functional domains that would prevent it from spurious cDNA synthesis using random primer/template combinations. Generally, synthesis of cDNA copies on random host RNA templates is not expected to benefit the host cell and should be prevented. The most straightforward way is to eliminate catalytic activity by replacing active site residues, as in Prp8. Another option is to change the configuration of the active site by inserting additional structural loops, as in RVT genes. Finally, TERTs have achieved strict substrate specificity via a high degree of specialization towards an unlinked highly structured RNA (called TER or TR), which contains a short reverse-complement of the telomeric repeat unit serving as a template, and interacts specifically with the TRBD domain to perform highly processive DNA synthesis by target-primed reverse transcription (TPRT) off the 3′-ends of exposed short G-rich tandem repeats at the ends of linear chromosomes [62]. It is fascinating to realize that the specialized enzyme predicted to overcome terminal DNA loss and to preserve chromosome integrity takes its origins from mobile elements initially poised to disrupt chromosomal stability.