Mechanism of Alu integration into the human genome
- First Online:
LINE-1 or L1 has driven the generation of at least 10% of the human genome by mobilising Alu sequences. Although there is no doubt that Alu insertion is initiated by L1-dependent target site-primed reverse transcription, the mechanism by which the newly synthesised 3′ end of a given Alu cDNA attaches to the target genomic DNA is less well understood. Intrigued by observations made on 28 pathological simple Alu insertions, we have sought to ascertain whether microhomologies could have played a role in the integration of shorter Alu sequences into the human genome. A meta-analysis of the 1624 Alu insertion polymorphisms deposited in the Database of Retrotransposon Insertion Polymorphisms in Humans (dbRIP), when considered together with a re-evaluation of the mechanism underlying how the three previously annotated large deletion-associated short pathological Alu inserts were generated, enabled us to present a unifying model for Alu insertion into the human genome. Since Alu elements are comparatively short, L1 RT is usually able to complete nascent Alu cDNA strand synthesis leading to the generation of full-length Alu inserts. However, the synthesis of the nascent Alu cDNA strand may be terminated prematurely if its 3′ end anneals to the 3′ terminal of the top strand’s 5′ overhang by means of microhomology-mediated mispairing, an event which would often lead to the formation of significantly truncated Alu inserts. Furthermore, the nascent Alu cDNA strand may be ‘hijacked’ to patch existing double strand breaks located in the top-strand’s upstream regions, leading to the generation of large genomic deletions.
KeywordsAlu insertion polymorphisms Human genetic disease Human genome evolution L1 LINE-1 Retrotransposition
Database of Retrotransposon Insertion Polymorphisms in humans
- LINE-1 or L1
Long interspersed element-1
Target site-primed reverse transcription
Target site duplications
LINE-1 (long interspersed element-1) or L1-mediated retrotransposition has significantly impacted upon human genome evolution (for recent reviews, see Deininger et al. 2003; Kazazian 2004; Han and Boeke 2005; Hedges and Batzer 2005) but has also given rise to human genetic disease (Chen et al. 2005, 2006). Intriguingly, L1 elements have driven the generation of some 10% of the human genome mass by mobilising Alu sequences (Lander et al. 2001; Batzer and Deininger 2002). Although there is no doubt that Alu insertion is initiated by L1 endonuclease and reverse transcriptase (RT)-dependent target site-primed reverse transcription (TPRT; Dewannieux et al. 2003; Hagan et al. 2003), the mechanism by which the newly synthesised 3′ end of a given Alu cDNA attaches to the target genomic DNA is less well understood. In this regard, the integration of full-length L1 elements has recently been proposed to occur via a template-jumping model whereas the integration of 5′-truncated L1 elements is thought to result predominantly from a microhomology-mediated end-joining (MMEJ) model (Zingler et al. 2005; Babushok et al. 2006). The integration of full-length Alu elements can also be explained, at least in principle, by the template-jumping model. However, unlike 5′-truncated L1 elements, 5′-truncated Alu elements appear by and large not to be integrated via the MMEJ model (Zingler et al. 2005).
Identification of microhomology existing between the top strand’s 5′ overhang and the sequence that lies 5′ to the truncation position in the Alu consensus sequence
The sub-family of each selected Alu insert was checked/annotated using RepeatMasker (http://www.repeatmasker.org/cgi-bin/WEBRepeatMasker; as of December 6, 2006). Although in some cases, annotations were different from those previously reported in Chen et al. (2005, 2006) and dbRIP, this did not affect the conclusions of the study in any way. Consensus sequences of AluYa5, AluYa8, AluYb8, AluYb9, AluY, AluSq, AluYg6, AluYd8 and AluSp sub-families were taken from Repbase (http://www.girinst.org/repbase/update/browse.php; Jurka et al. 2005). Sequence alignments were performed with ClustalW (http://www.ebi.ac.uk/clustalw/#).
A trimodal length distribution of simple Alu inserts and the role of microhomology in generating shorter Alu inserts
Studies of recently inserted genomic L1 elements in the human genome (Myers et al. 2002; Pavlicek et al. 2002; Szak et al. 2002; Boissinot et al. 2004), pathological L1 direct insertions (Chen et al. 2005), and de novo L1 insertions in cultured human cells (Gilbert et al. 2002; 2005) as well as in a transgenic mouse model (Babushok et al. 2006) have consistently shown that simple L1 inserts display a bimodal length distribution with a large peak of short (<2 kb) and a smaller peak of longer (∼6 kb) integrations. Although the exact mechanism underlying this bimodal distribution remains controversial (e.g. Farley et al. 2004; Gilbert et al. 2005), the generation of the abundant short L1 inserts would appear to be facilitated by the presence of microhomologies frequently found between the top strand’s 5′ overhang in the target genomic sequence and the 3′ end of the nascent L1 RT-transcribed cDNA strand (Zingler et al. 2005; Babushok et al. 2006).
Correlation between the Presence of Microhomology (1–7 bp) and the length of the 5′ truncation of Alu insertion polymorphismsa
Number of entries manifesting microhomology (A)
Total number of entries (B)
23 (1 bp)
17 (≥2 bp)
10 (1 bp)
5 (≥2 bp)
17 (1 bp)
12 (≥2 bp)
As mentioned above, only 34.8% of the Group II Alu inserts were found to exhibit microhomology. By contrast, microhomology was found in some 50% (44/89) of the Group III Alu inserts. As a matter of fact, in the context of the 5′ truncated Alu insertion polymorphisms (i.e. starting positions, 8–271), there exists a positive correlation between the presence of microhomology and the length of the 5′ truncation (Table 1), thereby suggesting an important role of the MMEJ mechanism in generating shorter Alu inserts. Under this model, the generation of most of the shorter Alu inserts could have been promoted by the inadvertent annealing of the microhomology present between the 3′ end of the nascent Alu cDNA strand and the 3′ end of the top strand’s 5′ overhang. This would then be followed by the premature termination of nascent cDNA strand synthesis with concomitant initiation of second Alu cDNA strand synthesis by either a second L1 RT or a host DNA repair enzyme. In addition, we should point out that our finding differs from the recent genome-wide analysis that has concluded that 5′ truncated Alu elements exhibit no (or only a weak) tendency to exhibit microhomology (Zingler et al. 2005). The discrepancy may be due to one or more of the following reasons. Firstly, Zingler et al. (2005) did not address the microhomology issue in relation to the different lengths of 5′ truncation. Secondly, these authors used only computer-generated data with respect to the analysis of the 5′ truncated Alu insertions. In other words, they did not analyse the relevant data manually. As shown in Supplementary Tables S3–S6, our manual evaluation led to the re-annotation of a significant fraction of the dbRIP entries.
Near Full-Length Alu insertion polymorphisms (i.e. starting positions 2–5 in accordance with their respective consensus sequences) that can be alternatively interpreted as full-length insertionsa
Number of entries that can be alternatively interpreted as full-length insertions
Total number of entries
Large deletion-associated short Alu inserts appear to be integrated through qualitatively different mechanisms
The generation of the three disease-causing large genomic deletions associated with Alu insertions can in principle be accounted for by the model illustrated in Fig. 6B from Gilbert et al. (2002): each event was putatively initiated by L1 endonuclease cleavage on the bottom strand but, unlike the typical process of TPRT leading to the generation of a simple insertional event, the L1 RT-transcribed Alu cDNA strand appears to have invaded a double strand break located far upstream of the bottom strand nick/break (Chen et al. 2005). This model can be further refined in the light of new developments in the field. Thus, in a genome-wide analysis of both human and chimpanzee data sets, Han et al. (2005) observed a significant positive correlation between the size of the L1 direct insertion and the size of the associated deletions. Han et al. (2005) surmised that the longer the newly synthesised L1 cDNA strand was, the higher would be the probability of forming sufficient complementarity between the end of the L1 cDNA and the region flanking the 5′ end of the L1 insertion in the ancestral sequence. This is indeed a plausible explanation for the generation of large genomic deletions created upon L1 insertion. This model cannot however be readily extrapolated to cases of large genomic deletions caused by insertions of Alu elements, simply because the Alu inserts in the three disease-causing events are significantly 5′ truncated (see Fig. 1). This notwithstanding, the model of Han et al. (2005) stimulated us to propose a refined model for the generation of large genomic deletions caused by Alu insertions: the significant sequence similarity existing between the regions spanning the top strand’s upstream deletion breakpoints and the newly synthesised Alu cDNA strands in all three cases (Fig. 4) suggests that the longer the stretch of complementarity, the higher the likelihood of a newly synthesised Alu cDNA strand annealing to a double strand break-containing far-upstream region. In this refined model, the position of the Alu truncation would be specified by the position of the double strand break in the top strand whereas the synthesis of the Alu cDNA strand might not necessarily need to be completed in order to obtain sufficient complementarity for strand annealing/invasion.
One further point warrants further discussion. It is possible that the top strand’s upstream double strand break may be attributable to the activity of L1 endonuclease (Gasior et al. 2006). Were this to be the case, this could predict an active role for L1-mediated retrotransposition in creating large genomic deletions. It should however be emphasised that the L1 endonuclease used to generate the top strand’s upstream double strand break may not necessarily be the same as that used to create the bottom strand’s first nick (Mine et al. 2007), by analogy to the proposition that two different L1 RT molecules may be used for twin-priming, leading to L1 inversion (Ostertag and Kazazian 2001b). It is equally possible that the top strand’s upstream double strand break was created independently of L1 endonuclease. Were this to be the case, “a fascinating scenario would present itself: the organism could have ‘hijacked’ the L1 machinery to repair an existing double strand break through a mechanism akin to single strand annealing.” (Chen et al. 2005). In this particular context, L1 integration may represent a ‘host/parasite battleground’ as it has been termed by Gilbert et al. (2005), in which L1 integration finds itself in a ‘race’ to complete cDNA synthesis before being ‘hijacked’ to patch an upstream double strand break.
A unified model for Alu insertion into the human genome
Based upon the above observations, we propose a unified model for Alu insertion in the human genome. Since Alu elements are comparatively short, L1 RT is usually able to complete nascent Alu cDNA strand synthesis before jumping to the 3′ end of the top strand’s 5′ overhang, resulting in the generation of either full-length (i.e. Group I events) or 5′ truncated (i.e. Group II events) Alu inserts. Alternatively, the synthesis of the nascent Alu cDNA strand may be terminated prematurely if its 3′ end anneals to the 3′ terminal of the top strand’s 5′ overhang by means of microhomology-mediated mispairing, an event which would often lead to the formation of significantly truncated (Group III) Alu inserts. Furthermore, the nascent Alu cDNA strand may be ‘hijacked’ to patch existing double strand breaks located in the top-strand’s upstream regions (which should usually comprise Alu-rich sequences), leading to the generation of large genomic deletions. Clearly, the unified model proposed here is likely to be subjected to further modification/revision by new studies as they emerge.
This work was supported by the INSERM (Institut National de la Santé et de la Recherche Médicale), France.
- Lander ES, Linton LM, Birren B, Nusbaum C, Zody MC, Baldwin J, Devon K, Dewar K, Doyle M, FitzHugh W, Funke R, Gage D, Harris K, Heaford A, Howland J, Kann L, Lehoczky J, LeVine R, McEwan P, McKernan K, Meldrim J, Mesirov JP, Miranda C, Morris W, Naylor J, Raymond C, Rosetti M, Santos R, Sheridan A, Sougnez C, Stange-Thomann N, Stojanovic N, Subramanian A, Wyman D, Rogers J, Sulston J, Ainscough R, Beck S, Bentley D, Burton J, Clee C, Carter N, Coulson A, Deadman R, Deloukas P, Dunham A, Dunham I, Durbin R, French L, Grafham D, Gregory S, Hubbard T, Humphray S, Hunt A, Jones M, Lloyd C, McMurray A, Matthews L, Mercer S, Milne S, Mullikin JC, Mungall A, Plumb R, Ross M, Shownkeen R, Sims S, Waterston RH, Wilson RK, Hillier LW, McPherson JD, Marra MA, Mardis ER, Fulton LA, Chinwalla AT, Pepin KH, Gish WR, Chissoe SL, Wendl MC, Delehaunty KD, Miner TL, Delehaunty A, Kramer JB, Cook LL, Fulton RS, Johnson DL, Minx PJ, Clifton SW, Hawkins T, Branscomb E, Predki P, Richardson P, Wenning S, Slezak T, Doggett N, Cheng JF, Olsen A, Lucas S, Elkin C, Uberbacher E, Frazier M, Gibbs RA, Muzny DM, Scherer SE, Bouck JB, Sodergren EJ, Worley KC, Rives CM, Gorrell JH, Metzker ML, Naylor SL, Kucherlapati RS, Nelson DL, Weinstock GM, Sakaki Y, Fujiyama A, Hattori M, Yada T, Toyoda A, Itoh T, Kawagoe C, Watanabe H, Totoki Y, Taylor T, Weissenbach J, Heilig R, Saurin W, Artiguenave F, Brottier P, Bruls T, Pelletier E, Robert C, Wincker P, Smith DR, Doucette-Stamm L, Rubenfield M, Weinstock K, Lee HM, Dubois J, Rosenthal A, Platzer M, Nyakatura G, Taudien S, Rump A, Yang H, Yu J, Wang J, Huang G, Gu J, Hood L, Rowen L, Madan A, Qin S, Davis RW, Federspiel NA, Abola AP, Proctor MJ, Myers RM, Schmutz J, Dickson M, Grimwood J, Cox DR, Olson MV, Kaul R, Raymond C, Shimizu N, Kawasaki K, Minoshima S, Evans GA, Athanasiou M, Schultz R, Roe BA, Chen F, Pan H, Ramser J, Lehrach H, Reinhardt R, McCombie WR, de la Bastide M, Dedhia N, Blocker H, Hornischer K, Nordsiek G, Agarwala R, Aravind L, Bailey JA, Bateman A, Batzoglou S, Birney E, Bork P, Brown DG, Burge CB, Cerutti L, Chen HC, Church D, Clamp M, Copley RR, Doerks T, Eddy SR, Eichler EE, Furey TS, Galagan J, Gilbert JG, Harmon C, Hayashizaki Y, Haussler D, Hermjakob H, Hokamp K, Jang W, Johnson LS, Jones TA, Kasif S, Kaspryzk A, Kennedy S, Kent WJ, Kitts P, Koonin EV, Korf I, Kulp D, Lancet D, Lowe TM, McLysaght A, Mikkelsen T, Moran JV, Mulder N, Pollara VJ, Ponting CP, Schuler G, Schultz J, Slater G, Smit AF, Stupka E, Szustakowski J, Thierry-Mieg D, Thierry-Mieg J, Wagner L, Wallis J, Wheeler R, Williams A, Wolf YI, Wolfe KH, Yang SP, Yeh RF, Collins F, Guyer MS, Peterson J, Felsenfeld A, Wetterstrand KA, Patrinos A, Morgan MJ, Szustakowki J, de Jong P, Catanese JJ, Osoegawa K, Shizuya H, Choi S, Chen YJ, International human genome sequencing consortium (2001) Initial sequencing and analysis of the human genome. Nature 409:860–921PubMedCrossRefGoogle Scholar
- Szak ST, Pickeral OK, Makalowski W, Boguski MS, Landsman D, Boeke JD (2002) Molecular archeology of L1 insertions in the human genome. Genome Biol 3(10):research0052Google Scholar
- Zingler N, Willhoeft U, Brose HP, Schoder V, Jahns T, Hanschmann KM, Morrish TA, Lower J, Schumann GG (2005) Analysis of 5′ junctions of human LINE-1 and Alu retrotransposons suggests an alternative model for 5′-end attachment requiring microhomology-mediated end-joining. Genome Res 15:780–789PubMedPubMedCentralCrossRefGoogle Scholar