Introduction

Elements such as viruses and transposons, through evolution with their host organisms, have acquired the ability to integrate into host genomes and ultimately shuffle genetic material between organisms. These elements have an established history in molecular biology and genetics research because of their ability to deliver specific genetic cargo, randomly disrupt host genomes for genetic screens, and serve as vectors for delivery of therapeutic expression cassettes to treat human disease. Viral vectors have been the predominant tools for these applications for three reasons: the ease and efficiency with which specific viral genetic cassettes can be introduced into cells; the vast accumulated knowledge of viruses and their mechanisms of gene transfer into chromosomes; and the large number of sites in genomes into which they can integrate. Retroviruses in particular have been used for random insertion into chromatin to interrupt host genes (insertional mutagenesis) and thereby identify their function [13] as well as for delivery of therapeutic genes [46]. Moreover, viral activation of oncogenes and, more recently, inactivation of tumor suppressors have been used to discover several novel genes that are involved in cancer progression [712]. The consequence of insertional activation of host cell oncogenes by viral vectors, however, has emerged as a major risk/obstacle in gene therapy, with a few cases of leukemia arising from oncogene activation by therapeutic vectors [13, 14]. The potential genetic consequences of insertions of integrating vectors are summarized in Figure 1.

Figure 1
figure 1

Potential genetic consequences of integration of transgenic cassettes into chromatin. An expression cassette (orange box) in a viral or nonviral vector (represented by purple inverted arrowheads, which indicate either inverted or direct terminal repeats) can integrate into four classes of chromatin. (1) Integration into heterochromatin will most likely result in the suppression of expression of the transgene and essentially no genetic consequences for the host. (2) Integration into intergenic regions of euchromatin is the most desirable outcome; the transgenic cassette is expressed, leading to a gain of function (GOF) in the host cell. (3) Integration into a transcriptional regulatory region can have several outcomes including expression (GOF) of the transgenic cassette, potentially modified by neighboring enhancer and silencer elements in the region. Regulatory elements in the transgenic cassette may either enhance expression of the neighboring gene (GOF for gene X) or, in rare cases, block expression of an active gene. (4) Integration of the vector into a transcriptional unit may allow expression of the transgene but block expression of the host gene leading to a phenotypic loss of function (LOF). Integration within some genes can also lead to a dominant gain of function (DGF) or production of a dominant-negative form (DNF) of the original gene X. A further discussion of effects of insertional mutagenesis can be found in the reports by Carlson and Largaespada [61] and Collier and Largaespada [154].

Risk of oncogene activation in gene therapy

Activation of oncogenes in mice by insertionally mutagenic retroviruses suggested that inadvertent oncogene activation resulting from the use of relatively benign therapeutic vectors is a potential risk associated with gene therapy. Gene therapy vectors are extensively minimized to eliminate their replicative potential and reduce their collateral effects on the target genome [15]. However, extensive testing in animals demonstrated that the risk of oncogenic activation was real, although variable and dependent on the viral vector used, the genetic cargo, and the background genetics of the model system [1622]. Given what was assumed to be acceptable risk, retroviral gene therapy trials have been conducted in human patients. Nearly 1,000 clinical gene therapy trials have been initiated, more than half with retroviral vectors [4], but as yet no vectors have been approved in the USA for clinical gene therapy outside the clinical trial setting [23]. (Gendicine, an adenovirus designed to restore p53 function in cancerous cells, has been approved for commercial human gene therapy in China [24], although this vector is essentially nonintegrating and thus carries decreased risk for oncogene activation via vector insertion.)

The worst fears of the gene therapy field, oncogene activation, were realized when three of more than 20 patients treated for X-linked severe combined immunodeficiency disease (X-SCID) developed leukemia. These adverse findings, including one death, occurred 3 years or more after administration of therapeutic murine leukemia virus (MLV)-derived retrovirus vectors [25, 26]. The linkage between treatment and leukemias could be inferred because the expanded transformed cell populations harbored clonal integrations of the therapeutic vector, which suggested a biologic selection for the retrovirus-induced mutation [2730]. However, these studies also indicated that clonal expansions in some cases appeared to be temporary and did not always lead to adverse effects, features that could actually improve the likelihood of successful gene therapy. The cause of at least two of the leukemias appears to be insertion of the MLV vector close to the LMO2 oncogene, which led to LMO2's activation by enhancers in the long terminal repeat (LTR) sequences of the vector [3133]. Retrospective examination of the role in LMO2 during development supported this conclusion [34, 35]. Subsequent studies in which the cargo gene IL2γc was over-expressed in mice (albeit at levels higher than in the X-SCID leukemia patients) suggested that this gene could itself act as an oncogene in T cells [36]. Also, simultaneous activation of IL2γc and LMO2 by oncogenic retroviruses had been observed in one mouse, suggesting a possible genetic interaction between the cargo IL2γc gene and LMO2 [33]. The relevance of these observations to clinical cases, however, is highly debatable [37, 38].

In contrast, other gene therapy trials that employed retroviral vectors to treat adenosine deaminase deficiency [3941] and chronic granulomatosis disease (CGD) [42] have not yet reported any equivalent adverse events. In the CGD study, there appeared to be powerful selection for integration events of the spleen focus-forming virus vector, which also was used as a vector for X-SCID [43], into the neighborhoods of three previously identified genes, namely MDS-EVI1, PRDM16, and SETBP1, which have been associated with enhanced proliferation following integration of retroviruses with activating LTRs [4446]. As noted previously, findings of preferential integration around certain genes is not necessarily due to a preference for these genes, but may rather be a consequence of clonal expansion that can be transient and thereby beneficial in terms of enhancing the number of therapeutic cells. A similar effect has also been observed in nonhuman primate studies, indicating that this result may not be unique [19]. Despite the striking incidence of common integration sites that are often associated with tumor or leukemia formation [8, 47, 48], there has been no report of adverse events in the CGD patients and no indication that the corrective gene, gp91phox, synergizes with any of the three common integration site genes to promote growth. Likewise, a murine stem cell retrovirus has been used to deliver the α and β chains of the antiMART-1 T-cell receptor complex ex vivo into peripheral blood lymphocytes to treat melanoma without any apparent adverse effects, although integration sites were not examined and the patient population had low odds for survival, even with the treatment (two out of 15 survived) for more than 1 year [49].

Taken together, the results of the CGD and X-linked plus adenosine deaminase SCID trials demonstrate that oncogenesis is not necessarily an inherent, inevitable side effect of gene therapy. In more than 20 patients, the genetic deficiencies of more than 80% have been fully corrected, allowing them to lead normal lives. However, tumors and leukemias can take years to manifest, and these trials are in their early years. A clearer understanding of the variables that underlie oncogenesis is needed in order to increase the safety of these trials. These variables include insertion site preferences of therapeutic vectors, their abilities to activate nearby genes, and interactions between specific genetic cargos and activated host genes. Although cargo-host interactions will be specific to each gene therapy approach, the vectors themselves govern other parameters of insertion preference and neighboring gene activation. Analyses of insertion preferences, in particular, have received much recent attention, and have sparked interest in the use of transposons as alternatives to viruses as gene therapy vectors.

Nonviral vectors for introduction of genetic cassettes into mammalian genomes

Transposable elements also have been used for insertional mutagenesis and genetic studies in model organisms, and are being developed as gene therapy agents in humans [5053]. The most well characterized DNA transposon vector used in mammals is the synthetic Sleeping Beauty (SB) transposon system [54], which over the past decade has become a powerful tool in functional genomics to identify genes in vertebrates, including fish and mammals [5561]. Application of transposon-mediated gene transfer to gene therapy has been explored because it avoids several disadvantages of viral delivery systems. These disadvantages of viruses include the following: (1) their preference for integrating into genes [6265]; (2) the difficulty with purification to eliminate toxic or infectious agents [66]; (3) their potential to elicit unwanted immune or inflammatory responses [67, 68]; (4) the constraint on therapeutic cargo size; and (5) the difficulty and expense associated with their production in large quantities [69, 70]. In contrast to viral vectors, preparations of nonviral plasmid-based transposon vectors are relatively inexpensive to purify, are largely nonimmunogenic, and have no hard constraints on genetic sequences that can be delivered.

A negative tradeoff with DNA vectors is increased difficulty in delivery. Delivery of nonviral DNA into mammalian genomes involves avoiding or traversing numerous barriers, including enzymes in the blood and cellular environments, the endothelial lining of vessel walls, cellular plasma membranes, endosomal membranes, nuclear membranes, and chromosomal integrity [71].

There are three delivery approaches that work across the nanoscale, microscale, and macroscale [72]. Nanoscale delivery involves particles or complexes that are most often designed to be about 100 nm or less in diameter, although sizes up to 1 μm fit into this category. The nanoscale approach comprises delivery of single or small numbers of DNA molecules, which most often are collapsed by polycationic polymers (for example, polylysine and other modified amino acids, and various linear and branched forms of polyethylenimine, among others) or lipids, with or without various ligands (for review, see the report by Wagner and coworkers [71]). Some polycationic complexes are cytotoxic or unstable in the blood, which can be circumvented by encasing the complexes in polyethylene glycol [73]. Alternative delivery routes are those at the microscale and macroscale, in which DNA in packages up to 10 μm are phagocytized (microscale) or enter cells via fusions with other cells or entities larger than 10 μm (macroscale).

In mice, the most effective method for in vivo gene transfer and expression has been demonstrated in hepatocytes using simple infusion of naked plasmid DNA under increased pressure. This can be accomplished by hydrodynamic delivery of DNA using high pressure/high volume injection [74, 75]. In mouse, this procedure involves injection of a large volume (10% volume/weight) of DNA/saline solution through the tail vein in less than 10 seconds. This procedure results in uptake of infused DNA into as many as 10% of hepatocytes in test animals [74, 75] by expanding and rupturing liver endothelium, which in mice heals within 24 to 48 hours [76]. Achieving a clinically feasible method of local delivery to liver in large animals, including humans, is a challenge that is being addressed by more localized hydrodynamic delivery using specialized catheters or pressure cuffs [77, 78]. On the microscale, condensing DNA with polyamines such as polyethylenimine to a complex small enough to be taken up by cells into endosomes has been studied intensively [79, 80]. Our findings (Hackett PB, Podetz-Pedersen K, Bell JB, McIvor RS, unpublished data) suggest that gene expression following hydrodynamic delivery is about 100-fold more effective than delivery using polyethylenimine [81, 82] and only about 10-fold to 100-fold less effective than viral delivery to liver [72]. Alternative delivery ex vivo using electroporation is under development and has been achieved in hematopoietic stem cells [83].

Since the development of the SB system, nonviral, integrating DNAs have established themselves as potential vectors for gene therapy. Following hydrodynamic delivery, transposons have been used in mice to cure hemophilias A and B [8487] and tyrosinemia type I [88, 89]. Other somatic delivery methods were used to ameliorate blistering skin disease (junctional epidermolysis bullosa) [90], retard glioma xenographs [91, 92], produce Huntingtin protein in a model of Huntington disease [93], and as a preventive treatment for lung allograft fibrosis [94]. Based on the findings summarized above, we estimate that only about one in 10,000 SB transposons that are delivered to liver or lung actually transpose into chromatin (Hackett PB, unpublished data). Although this is a small fraction, it is possible to deliver more than 108 therapeutic cassettes to an animal in order to treat as many as 10% to 20% of liver cells with a single injection of plasmids [84, 88, 95]. This procedure is sufficient to cure diseases such as hemophilia and tyrosinemia type 1, and to ameliorate other diseases such as mucopolysaccharidoses types I and VII. Although quantifying the number of transposon insertions per cell has not been done because of the difficulty of cloning insertion sites in mostly nondividing cells in most organs of animals, the expression data are consistent with a single integration in most if not all transgene-expressing cells.

In addition to SB, several other transposon vectors and phage integrase-based vectors have been tested for their potential to deliver therapeutic genes, including Frog Prince [96], Tol2 [89], and piggyBac [97], as well as other well characterized transposons such as the Drosophila P-elements, which are not mobilized very efficiently in mammalian cells [98]. These vectors differ in their efficiency of gene insertion, genetic cargo capacity, integration site preferences, and effects on chromosomal stability. Among other advantages these systems have over retroviruses as gene therapy vectors, transposons present a wide variety of insertion site preferences that differ from those of retroviruses, with possible consequences for oncogene activation. The characteristics of these vectors are summarized in Table 1. The remainder of this review discusses these differences as they relate to gene therapy and functional genomics.

Table 1 Properties of nonviral integrating vectors proposed for gene therapy

Factors governing insertion site preferences and their variation among vectors

Although most vectors will integrate into a vast number of sites scattered throughout the genome, numerous studies have shown that these integrations are not random with respect to several variables. Global preferences for vector integration can be governed by large-scale genomic context such as coding and regulatory regions of genes, and their transcriptional status, as compared with intragenic regions [99]. The fine tuning that determines specific sites of integration is governed by smaller scale, physical features, such as the specific sequences of nucleotides surrounding insertion sites and DNA structural characteristics derived from these sequences. Figure 2 illustrates some of the physical features of DNA that are influenced by local sequence.

Figure 2
figure 2

Deviations of DNA structure from the average B-form DNA that play a role modeling three-dimensional structures of specific DNA sequences. The figure illustrates physical parameters of B-form DNA structure that are altered in preferred sites for integration of insertional vectors. (a) B-form DNA. (b) A-DNA. Interactions between neighboring nucleotides govern the variable energy needed to convert from B-DNA to A-DNA. The propensity of a sequence of B-form DNA to adopt the A-form is referred to as A-philicity [134]. (c) Parameters of base pair orientation affected by protein-DNA binding. 'Twist' (horizontal looping arrow) refers to the rotation of base pairs around a central axis (heavy vertical black line); the average rotation between two base pairs is 36°. 'Tilt' (dotted lines) refers to the inclination of the base pairs with respect to the central axis; the average tilt is 0° between base pairs, which are normally parallel in B-form DNA. 'Rise' (vertical double arrowhead) is the distance between adjacent base pairs; the normal spacing is slightly more than 3.3 Å, but it can be more than 3.4 Å at preferred target sites. 'Slide' (horizontal double arrowhead) refers to the shifting of the axis of a base pair out of alignment with the central axis. 'Roll' (vertical looping arrow) refers to rotation of the nucleotide plane around a horizontal axis. A given base pair may be distorted in more than one of these parameters. V step analysis is a method of examining these, and other physical parameters such as 'shift', in terms of a single number that derives from the transition from one base pair to another [131,137]. (d) DNA bendability

Viruses and transposons exhibit a wide range of variability with respect to preference for genes and transcriptional units. Several studies have mapped hundreds to thousands of insertions into human or mouse genomes, and correlated insertion positions with known genes. Many retroviruses exhibit a nonrandom preference for genes [65]. This could be due to greater accessibility of the DNA in 'open' chromatin or interaction of integrase enzymes with cellular factors bound to transcriptional regulatory elements. In the case of HIV, the LEDGF/p75 transcriptional factor may act as a tether between the integrase and transcriptionally activated chromatin [100102], which is similar to an idea that was proposed previously for designer targeting of integrating vectors [103105]. In a similar approach using the SB transposon, Yant and coworkers [106] found that SB exhibited a much lower (although nonrandom) preference for genes. Although a preference for transcriptional units might seem beneficial for functional genomics studies, the myriad of recently identified noncoding RNA genes [107] (as well as other RNA product genes such as those encoding rRNA and tRNAs) involved in gene regulation may not be targeted by viral vectors that preferentially integrate into or near protein encoding genes. Targeting of various vectors to these non-coding RNAs in gene therapy, and any resulting deleterious effects, has not been extensively examined.

Many vectors appear to exhibit a preference for specific genes. In insertional mutagenesis studies, the identification of recurrent viral insertions into a specific group of genes was taken to mean that viral activation of these putative oncogenes in individual cells led to clonal expansion among a pool of cells in which every host gene was an equal target for integration (as discussed above for LMO2). However, when MLV insertions were mapped in normal HeLa cells that did not undergo any type of selection, oncogenic or otherwise, many of these same genes harbored recurrent integrations, suggesting that vectors may inherently target specific genes [48]. The basis of this selection is not understood, but it may be similar to that discussed above for HIV.

In addition to general preferences for genes, many viral vectors, including retroviruses, lentiviruses, and adeno-associated virus, preferentially target transcriptional units or their promoters. MLV retroviruses have a preference for integration proximal to transcriptional initiation sites [64, 65, 108111], which is a problematic trait, considering that MLV-based vectors are the most commonly used vectors in human gene therapy [4]. HIV and adeno-associated viruses have preferences for entire transcriptional units [100, 108, 111113] (see Note added in proof, below); this is in contrast to MLV, which targets only the region proximal to promoters. Additionally, expression array studies have shown that HIV has a preference for transcriptionally active genes [65] as well as an avoidance of chromatin regions in which transcription is repressed [114].

In contrast to these viral vectors, SB transposons and avian leukosis virus (a retrovirus) apparently have only a slight preference for either transcriptional units or their regulatory elements [106, 115], with little or no preference for transcriptionally active genes [65]. In one survey, SB exhibited an overall preference for microsatellite repeats, found primarily in noncoding regions [106], possibly due to the preferred target sites found in TA repeats [116]. A study that correlated insertions sites with hundreds of genome annotations [99] illustrated the degree to which genomic features and primary sequence influenced vector integration preferences for several vectors (for example, the L1 and SB transposon insertions were much more influenced by primary sequence than were retroviral vectors). This study also found variable preferences between vectors for elements such as CpG islands, DNase I sensitive sites, and transcription factor binding sites. The recent identification of a periodic sequence encoding nucleosome positioning [117] may also correlate with vector integration patterns, because nucleosomes have been shown to affect patterns of retroviral integration [118]. Similar studies to identify trends for piggyBac and Tol2 with respect to genome-wide integration preferences will be valuable in assessing the relative safety of these vectors for gene therapy.

Local insertional preferences: DNA sequence and structure

Although many vectors exhibit a preference for genes, and even specific genes, few vectors repeatedly integrate into the same precise position with any significant frequency. Rather, most genes harboring frequent insertions show a distribution of insertions into several positions within the same gene. Some vector integrases, such as those for phages φC31 [119121], φBT1 [122], as well as the Escherichia coli Tn7 transposon [123], recognize specific DNA sequences or degenerate sequences that exist in mammalian genomes. SB integrates specifically at a TA dinucleotide, and the piggyBac transposon integrates into the sequence TTAA. Because the oncogenic potential of a vector is related to its propensity to integrate in or near a select few genes, understanding local parameters that affect integration may contribute to our ability to assess the risk associated with these vectors in gene therapy.

For retroviruses and the SB transposon, consensuses sequences have been described surrounding the sites of integration [111, 124127]. Although retroviruses do not exhibit a strong consensus sequence, the nonrandom pattern of integrations and the observation that frequently hit sites did not match the consensus sequences led investigators to examine other properties of DNA sequences surrounding target sites, including structural characteristics of the DNA itself. DNA structural characteristics are based on non-Watson and Crick interactions between nucleotides and encompass deformations to the regular double helix structure caused by interactions between adjacent, planar bases (Figure 2). Originally characterized from analysis of crystal structures of DNA bound to histones and other proteins, these characteristics include 'protein-induced DNA deformability', 'A-philicity', and trinucleotide 'bendability'. These properties underlie local variations in DNA structure that are probably relevant to recognition of DNA by transposases and integrases. Early investigations into insertion preferences showed that viruses preferred 'bent' DNA [118, 128, 129], and several groups have investigated secondary DNA structural patterns in sequences that flank mapped insertion sites for both transposons [115, 124, 130, 131] and retroviruses [111, 126] to determine general characteristics of the flanking sequence of 'preferred' integration sites. Similarly, the RAG1/2 protein complex, which has properties akin to the cut-and-paste transposases, recognizes a specific sequence/structure for recombination of antigen receptor genes [132].

Different DNA sequences may produce highly similar patterns of DNA secondary structure, and thus common structural patterns that are preferred for integration may be obscured by approaches that analyze sequence alone. Analysis of secondary structure for a DNA sequence is based on translation of a sliding window of two or three bases into structural values for each 'step'. For example, the tendency of a B-form helix to adopt the A-form (A-philicity; Figure 2) can be predicted by translating each consecutive (over-lapping) dinucleotide into one of 10 A-philicity values for the 16 combinations of base pair transitions [133135]. Similarly, protein-induced deformability encompasses several changes in base pair orientation from a 'perfect B-form double helix' in a transition between two consecutive base pairs (Figure 2c). All of these changes can be expressed as a single composite parameter of protein-induced DNA deformability known as V step [136138]. V step represents the physical relationships of any two planar base pairs in terms of their relative shifts and angular orientation. In contrast to A-philicity and protein-induced deformability, DNA bendability is best modeled using a sliding window of three bases, with 64 possible trinucleotide bendability values [139].

An example of DNA structural analysis for the Tol2 transposon is shown in Figure 3, in which average structural values for each position flanking an insertion site are plotted and compared with a plot of random sequences. In the case of Tol2, weak preferences in V step and A-philicity values at specific coordinates are apparent by the peaks in the heavy black lines in Figure 3a,b (left sides), in contrast to the same averages derived from random sequences (right sides). Overall, the bendability around Tol2 insertion sites exhibits little deviation from a random sequence (Figure 3c), unlike those preferred by SB transposase (Figure 3d). Analysis of hundreds of integration sites for potential gene therapy vectors, including viruses as well as transposons, shows that many have subtle preferences for these variables (Figure 4). For example, the piggyBac transposon may favor sites with slightly higher A-philicity, lower bendability, and lower V step values than random sequences. In contrast, 'preferred' SB insertion sites (see below) clearly display a jagged V step pattern and higher bendability. Interestingly, although retroviruses (avian sarcoma virus [ASV], HIV, MLV, and simian immunodeficiency virus) integrate into bent DNA [128], such as that bound to nucleosomes, our analyses of sequences around viral insertion sites do not indicate a particular preference for bendable DNA (Figure 4). A similar, more rigorous approach has been utilized to characterize Drosophila P-elements [130] and non-LTR retrotransposons in Entamoeba histolytica [140], demonstrating that DNA structural characteristics at insertion sites for both elements are significantly different from collections of random sequences.

Figure 3
figure 3

Approaches to identification of DNA structural characteristics governing insertion site preferences for Tol2 and SB transposons. (a) Averaging of all available insertion sites smoothes trends observed in individual plots. Plot of V step profiles of 18 20-base-pair Tol2 insertions (left, from Balciunas and coworkers [89]) compared with 18 randomly generated sequences (right). Averages are shown by thick black lines. Although individual Tol2 profiles appear jagged, peaks are not position specific, and so the plot of the average of 36 sites reveals only one small, distinct peak. Individual random sequences also appear jagged, but an average of over 9,000 random sequences is a flat line. (b) Analyses of Tol2 insertion site A-philicity profiles, compared with 18 random sequences. Trends are similar to V step patterns. (c) Plot of trinucleotide bendability for Tol2 and random sites, indicating only small common trends compared with random sequence. The random sequences in panels a to c were acquired from a 10 megabase portion of human chromosome 1p. (d) Bendability plots for Sleeping Beauty (SB) insertion sites (from Yant and coworkers [106]). The average trinucleotide bendability at each position of 12-base insertion sites is shown for 574 insertions ('all sites'), as well as a subset of 189 insertions classified as 'preferred' based on V step profiles ('preferred sites'). Random TA sites are shown in green, and random sites in black. This plot shows how identification of 'preferred' sites can be useful in distinguishing structural patterns for common insertion sites; preferred sites (based on common patterns of protein-induced deformability in recurrently hit sites) exhibit an overall increase in a separate parameter, DNA bendability, when 'basal' sites are removed.

Figure 4
figure 4

Variability in DNA structural characteristics between insertion sites for various vectors. All (a) A-philicity, (b) trinucleotide bendability, and (c) V step values were summed across 12 nucleotides and averaged for all sites of each vector class. (d) 'Jaggedness' was measured by taking the absolute value of differences between adjacent V step values, which were then summed and averaged, as in panels a to c. Error bars represent standard deviations. 'SB' indicates 574 Sleeping Beauty integrations into human cells identified by Yant and coworkers [106]. 'SB preferred' indicates a subset of 189 sites from the Yant dataset classified as 'preferred' by ProTIS [116]. 'tol2' indicates 63 Tol2 integrations [89]. 'piggyBac' indicates 297 piggyBac insertions deposited into Genbank by Exelexis containing a single TTAA sequence flanked by 10 bases on each side. 'P-element' indicates 920 P-element insertion sites mapped by Liao and coworkers [130]. 'ASV' indicates 357 avian sarcoma leukosis virus (ASLV) insertions into 293T-TVA cells. 'HIV' indicates 334 HIV integrations into SubT1 cells. 'MLV' indicates 695 murine leukemia virus integrations into HeLa cells. 'SIV' indicates 148 simian immunodeficiency virus integrations into CEMx164 cells. All P-element, ASV, HIV, MLV, and SIV sequences were kindly provided by Dr Xioalin Wu. All sites were compared with three sets of over 9,000 randomly selected 12-mers from 10 megabase sections of human chromosome 1 (Hs), mouse chromosome 4 (Mm), and Drosophila chromosome 3L (Dm), and 10,000 randomly selected TA and TTAA sites from human chromosome 1.

For SB, the observation of general structural trends surrounding insertion sites eventually led to the identification of a specific DNA structural pattern governing insertion preference. Vigdal and coworkers [124] observed that increased DNA deformability and A-philicity were features of a consensus sequence that flanked SB TA insertion sites. Subsequently, Liu and colleagues [131] mapped about 200 integrations into a relatively small 7 kilobase plasmid sequence and observed that some common integration sites did not share the consensus sequence. These results identified several 'preferred' TA dinucleotides that harbored recurrent integrations. These preferred integration sites exhibited a striking, specific pattern of alternating high and low deformability (V step ) values that were absent in TA sites and that were rarely, if ever, used. This led to the conclusion that SB transposase prefers a 'zigzag' V step pattern of DNA deformability [131], which was later confirmed on a larger, genomic scale [115]. It remains unknown whether these patterns influence the recognition and binding of the SB transposase, catalysis of the transposon integration, or some other mechanistic factor.

This analysis was repeated for other vectors, including piggyBac, P-elements, and several retroviruses [115]. However, only weak structural signatures were detected, which were no more informative than the weak consensus sequences previously identified. A key difference in the SB screen was the level of saturation of a small target, which allowed for the identification of highly preferred sites over nonpreferred TA dinucleotides. In contrast, the datasets for the other vectors were derived from a relatively small number of insertions into mammalian genomes, which were insufficient to obtain an initial set of preferred sequences. Because nonpreferred sites are likely to vastly outnumber preferred sites in the genome for most vectors, any genome-wide screen will produce a mix of indistinguishable preferred and nonpreferred sites. For example, we have estimated that of the approximately 200,000,000 TA sites in a human genome, only about 10% fall into the preferred category [115], although in the screen conducted by Yant and coworkers [106] 189 out of 573 (33%) genomic SB insertions were classified as preferred sites. Analysis of the bendability of all SB sites mapped in the screen reported by Yant and coworkers shows a peak at the center of the insertion site that is defined by the central TA dinucleotide. However, when only the preferred sites are analyzed, the surrounding nucleotides exhibit a much greater level of bendability (Figure 3d). This effect is in spite of the fact that the preferred sites were identified based on protein-induced deformability, V step , which is distinct from DNA bendability. The lesson from these studies is that most genome-wide datasets (particularly from experiments involving some form of genetic selection) will probably show a similar dilution effect of preferred sites by greater numbers of nonpreferred sites.

There is a caveat to the analyses discussed up to this point; they all assume that the structures around integration sites have an absolute center of reference, defined by the site into which the vector integrated. Such analyses could miss structural patterns that are not strictly position specific. For instance, an integrase may have preference for a local region that is highly bendable or deformable, but it may not have a requirement for a particular pattern (or sequence). To account for this, we have examined a parameter called 'jaggedness', which we define as the degree to which V step values alternate from high to low, as in the preferred 'zigzag' sites for SB. We calculated jaggedness by taking the sums of the absolute values of the differences between adjacent V step values across a sequence, so that a jagged/zigzag site would have a higher total value than a flat, basal site, which should have a jaggedness value close to 0. Jaggedness values for several vectors are shown in Figure 4. Although jaggedness values at insertion sites are similar to V step values for most vectors (with the possible exception of Tol2), the jaggedness patterns show a high degree of variability across genomic sequences and are somewhat independent of V step patterns (for instance, the c-myc gene; Figure 5).

Figure 5
figure 5

Insertion prediction for transposon vectors surrounding the c-myc locus on mouse chromosome 15. A 3 kilobase sequence from the mouse c-myc locus (from 61,813,400 to 61,816,400 base pairs) harboring 37 retroviral insertions submitted to the Mouse Retrovirus Tagged Cancer Gene Database [155] is shown. The first exon and intron of c-myc are shown in orange; the upstream promoter sequence is shaded in yellow. (a) Retrovirus insertion frequency per 50 base pair (bp) segment. Panels (b) to (g) show DNA structural characteristics at 50 bp resolution. (b) Total V step for each bin across the region. (c) Total V step jaggedness. (d) Total A-philicity values. (e) Total trinucleotide bendability. (f) Number of TTAA sequences per 50 bp bin, representing the total number of possible piggyBac insertion sites. Notably, many regions harboring oncogene-selected retroviral insertions have few or no TTAA sequences, suggesting that the likelihood of a piggyBac insertion causing an oncogenic event may be lower than that for retroviruses. Arrow represents a potential 'hotspot' for integration, over 1 kilobase upstream of exon 1. (g) ProTIS prediction shows a similar, low incidence of preferred SB integration sites. Arrow indicates predicted hotspot for integration over 1 kilobase upstream of exon 1, and slightly upstream of the TTAA hotspot. SB, Sleeping Beauty.

Integration preference versus oncogenic selection

We see two uses for profiling the insertion site preferences for integrating vectors. First, in functional genomics screens, insertion profiles that emerge can be compared with expected profiles that are only structure based rather than genetics based. A striking example of this is evident in the oncogene screens conducted with the SB transposon [58, 59], which is illustrated in Figure 6 with respect to the Braf gene. Integration sites that emerged from the screen are shown across the entire locus (Figure 6b) and in a selected region comprising exons 10-13/introns 10-12 (Figure 6d), where most of the integrations were selected because of induced expression of a truncated gain-of-function kinase polypeptide. Panels a and c show insertion site preference scores across the region obtained using an automated script (ProTIS) that counts and scores preferred TA dinucleotide insertion sites based on V step values [115]. The results shown in Figure 6 make two strong points. The first is that the frequency of oncogenic insertions in a select region correspond to that predicted on the basis of preference profiling (Figure 6c,d; specifically, microscale structure can be a good predictor of integration site preference). The second is that many predicted hotspots (Figure 6a,b) were not sites that lead to oncogenesis. The combination of these two observations enhances the biologic importance of the integrations into introns 11 and 12.

Figure 6
figure 6

SB insertions across the mouse Braf gene. Thirty Sleeping Beauty (SB) insertions deposited in the Retroviral-Tagged Cancer Gene Database were mapped across the entire Braf transcript and 10 kilobases upstream (NCBI 36 build; note that Braf is transcribed right-to-left). Most oncogenic insertions occurred in introns 11 and 12 (formerly annotated as intron 9). (a) ProTIS profiling across the entire gene reveals predicted hotspots for SB integration, but (b) most actual integrations were found in a relatively low scoring region corresponding to introns 11 and 12. A blowup of this local 4.9 kilobase region demonstrates that (c) ProTIS scores closely match (d) patterns of actual transposon integration. bp, base pairs

The second application of predicting profiles of vector insertions may be as part of a risk assessment program. Although current understanding of integration site preferences for most vectors is still inadequate to allow prediction of the probability of integration into specific genes, genome-wide integration datasets may suggest the likelihood that a vector will integrate within the general vicinity of a specific gene. Similarly, analysis of DNA structural characteristics may be used to assess the likelihood that each vector will integrate within specific regions of genes. For example, although Braf can act as a potent oncogene, the pattern of SB integrations into Braf suggest that integrations into a relatively small region of the gene (introns 11 and 12) are the most highly selected for oncogenesis, in spite of the presence of hotspots across the entire gene. Thus, the range of possible insertions that are capable of generating an oncogenic transcript, combined with the relative 'attractiveness' of the sequence across these regions, will dictate the chances of insertional activation.

An analysis of several structural characteristics is presented for the mouse c-myc gene (Figure 5), the human ortholog of which is activated in many cancers [141]. The figure highlights the 3 kilobase region encompassing the promoter that harbors the bulk of oncogenic retroviral integrations at this locus that have been deposited in the Retroviral-Tagged Cancer Gene Database (RTCGD [142]). The sequence was divided into 50 base pair (bp) bins, and the total values for V step , A-philicity, jaggedness, and bendability were summed across each bin. Measured in 50 bp bins, these structural parameters are highly variable across the sequence, and vary independently from each other. Actual oncogenic retroviral insertions observed in insertional mutagenesis screens and deposited into the RTGCD are shown for comparison in Figure 5a. The profiles indicate two features of transposons under consideration for gene therapy. First, the most likely sites for SB transposons to integrate (Figure 5g) are shifted away from the most commonly found activation sites, as revealed by retroviral integrations (Figure 5a). Second, the profile of TTAA sites, required by the piggyBac transposon (Figure 5f), is similar to the preferred SB sites, and further shows that some regions harboring retroviral integrations contain no TTAA sequences, making piggyBac insertions into these sites impossible. Thus, at first approximation, it would appear that the transposons are less likely to insert close to the c-myc promoter than are retroviral vectors. In support of this, c-myc is infrequently hit in SB-based insertional mutagenesis screens; to date, only one c-myc integration has been deposited into the RTCGD. In contrast, many retroviral insertions into c-myc have been mapped, although the number of deposited retroviral insertions is much higher than the number of transposons.

The relative lack of SB insertions into c-myc may be due to either a paucity of favorable SB insertion sites in regions of the gene competent for oncogenic activation, or an overall lack of oncogenic selection for insertions into this gene. In support of the former, transposon-free amplification of c-myc was one of the few genomic aberrations observed in tumors harboring mobile transposons (Largaespada DA, Collier LC, Hackett CS, unpublished observations), suggesting that activation of c-myc plays a role in the biology of these tumors (there was probably oncogenic selection for the genomic amplicon). Similar ProTIS analysis of the LMO2 locus revealed the most preferential integration sites for SB transposons that were considerably farther away from the LMO2 promoter than mapped integrations by activating retroviruses [115]. That said, it is evident that prediction of vector integration is not precise and even rare integrations into unfavorable sites have a potential to promote oncogenic expansion, as indicated in Figure 6.

Vector behavior in risk/outcome assessment: lessons from intentional oncogenic insertional mutagenesis

In spite of the inherent behavior of each integrating vector, existing evidence suggests that the oncogenic potential of any given vector can be attenuated depending on how it is used. As with retroviruses, the SB transposon has been used for functional genomics as well as for delivery of therapeutic genes in mouse models of inherited disease. These studies were motivated by two limitations of retroviruses for insertional mutagenesis: the limitation of viruses to infect specific cell types and the tendency of many viral vectors to insert near and activate a possibly limited number of genes [48]. In two recent SB mutagenesis screens, a transgenic concatemer of T2/Onc transposons carried in the germlines of mice was remobilized in somatic cells by a trans-acting, transgenic SB transposase. The two screens differed in expression level, domains of expression, and activity of the SB transposase, as well as the copy number of the transposon concatemers [58, 59]. An important finding from the two studies was that the oncogenic potential of the same T2/Onc transposon vector, which was engineered specifically to activate oncogenes and cause cancers in mice, varied between no observable phenotype on one end and rapid development of severe cancer at birth on the other. The oncogenic effect was directly related to the number and types of cells at risk for transposon-induced mutations and perhaps the remobilization rates. The same properties may be relevant for a wide range of other gene therapy vectors.

Coupled with the lack of a preference to integrate near genes, the chances that an SB insertion of a therapeutic gene (in contrast to a genetic cassette designed to wreak havoc on transcriptional units) will activate a neighboring host gene would seem to be lower than for vectors that have an affinity to integrate into genes [65, 97]. This feature may be a disadvantage for SB-based functional genomics studies aimed at mutating genes, but it may be advantageous for gene therapy.

Engineering safer vectors

As an alternative to finding vectors that do not target genes, several groups are attempting to target vector integration to a specific region of the genome by generating integrase and SB transposase molecules that are fused to DNA-binding domains that recognize specific DNA sequences [143, 144]. It appears that targeting introduces a reduction in activity, without much increase in specificity of integration into specific sites in a mammalian genome [144, 145]. This is not surprising if the ability of SB transposase to integrate promiscuously into TA sites is not abridged. There are about 2 × 108 potential TA-dinucleotide SB integration sites into which SB transposons can integrate, of which it is estimated that 2 × 107 are preferred integration sites [115]. Consequently, the chances of a sequence-specific targeting motif added to SB transposase actually guiding transposition to a specific, low-copy target sequence is expected to be extremely low compared with the chances of integrating into any of the millions of other available TA sites. Similarly, to overcome the risk for activation of neighboring genes following vector integration, self-inactivating vectors are being engineered to have diminished ability to activate genes over long distances [146, 147], although it is not clear whether these vectors will be safer [148]. The φC31 phage integrase system targets relatively few sites in mammalian genomes [119, 149], but it appears to introduce a relatively high level of chromosomal recombination [149151]. Thus, further development of safer vectors remains an open area of investigation.

Conclusion

Ultimately, functional genomics and gene therapy would like to answer the same question for any given vector (while hoping for opposite outcomes) - what are the chances of activating genes? There are four major factors influencing the answer, with each retroviral and transposon having different characteristics for each factor. First, what is the overall tendency of the vector to integrate into genes or promoters? Second, are there adequate local target sites around genes of interest to attract the vector? Third, over what distance can the vector activate a gene? Fourth, to what end can the integration activity be modulated to control the overall likelihood of hitting specific insertion sites close enough for activation of specific genes? Theoretically, knowing each of these variables for every vector would allow researchers to choose the vector with the most utility and lowest risk for the specific purpose intended. In gene therapy, these parameters translate into the risk for hitting a specific oncogene or tumor suppressor gene that could lead to a severe adverse effect. If, in the future, hotspots for integration of SB and other potential gene therapy vectors can be predicted, then we should be able to assess more accurately and modify the various risks for adverse effects from therapeutic vectors. This goal should be within reach in the coming years.

Note added in proof

Since submission of the manuscript, adeno-associated viral vectors (AAV) have been implicated in the induction of hepatocellular carcinomas in mice [152] and in the death of a patient in a clinical trial for treatment of rheumatoid arthritis [153].