Introduction

The oncogenic transformation of a cell requires several mutations conferring abnormal properties that may include independence from the requirement for growth stimulatory signals, loss of sensitivity to growth inhibitory signals, the acquisition of unlimited proliferative capacity, suppression of apoptotic signals, and the successful evasion of immune responses (Dunn et al. 2002; Hanahan and Weinberg 2000). A common feature of human leukemias and lymphomas is nonrandom, somatically acquired chromosomal translocations and inversions that are often diagnostic criteria of specific subtypes of malignancies. It has been estimated that 65% of all cases of human acute leukemias involve chromosomal abnormalities (Look 1997). Oncogenic translocations frequently act by upregulation of proto-oncogenes or inactivation of tumor suppressor genes (TSGs).

In the mouse, loci involved in the onset and progression of hematologic malignancies have been identified primarily by the isolation of common proviral insertion sites (CISs). Slow-transforming retroviruses cause a subset of reinfected cells to become oncogenic by altering the expression of genes near their insertion sites in a manner reminiscent of translocations in human malignancy. Aberrant gene expression can be caused by placing the affected locus under the direct control of viral long terminal repeat (LTR) promoters, long-range enhancer activity of the LTRs, disruption of the affected locus by proviral insertion within coding sequences, or local alterations of chromatin. The rare cases in which an insertion affects the expression of a proto-oncogene or TSG can result in the neoplastic transformation and clonal expansion of the affected cell. Because proviral insertions are near the loci whose expression they alter, these insertions can be used as a molecular tag to isolate and identify affected loci. Identification of these loci can be important for understanding the pathogenesis of human neoplasms because genes proximal to viral insertion sites in mouse tumors are often deregulated in human blood cancers, highlighting the applicability of slow-transforming retroviral mutagenesis as a model for human disease (Erkeland et al. 2006). Systematic surveys of CISs found in well-established mouse models of lymphoid and myeloid malignancies have implicated a wide range of genes involved in a diversity of cellular functions as potential oncogenes and TSGs. As with chromosomal translocations in human leukemia and lymphomas, many genes implicated in tumor incidence are known to be key regulators of normal hematopoiesis or have hematopoietic lineage-specific expression patterns (Suzuki et al. 2002). In addition, insertion sites that are commonly found in a particular class of neoplasia may identify loci whose altered expression is a common event in the development of that cancer subtype (reviewed in Jonkers and Berns 1996). The AKXD mouse strains chosen for the present study were previously found to develop a high incidence of B-cell lymphomas allowing us to focus our efforts on understanding the molecular basis underlying the pathogenesis of the most common human hematopoietic neoplasms (http://www.seer.cancer.gov).

The AKXD RI strains were generated from crosses of the T-cell lymphoma-prone AKR/J strain of mice with the DBA/2J strain of mice to generate 21 highly lymphomatous strains that vary in tumor incidence and tumor subtype. High rates of lymphoma correlated with the presence of transposable MuLV, indicating that MuLV is the primary mutagen in this model. Seven strains develop primarily B-cell neoplasms, whereas six develop T-cell, seven mixed B- and T-cell, and one primarily myeloid neoplasia (Gilbert et al. 1993). Many other mouse strains used for studying the molecular genetics of tumor development develop primarily myeloid disease (Eriksson et al. 1999; Jonkers et al. 1997; Liao et al. 1997; Moskow et al. 1995; Nakamura et al. 1996; Shaughnessy et al. 1999; Szymanska et al. 1999; Valk et al. 1997; van Lohuizen and Berns 1990; for review see Voisin et al. 2006), with the exception of the NFS.V+ strain, which also develops predominantly B-cell lineage neoplasms (Hartley et al. 2000). Each of these strains has proven to be very useful for identifying potential oncogenes and TSGs involved in the development of lymphoid and myeloid malignancies.

In this article we describe a new tumor subset from AKXD mice and perform a complete genomic analysis of the insertion sites. The use of our novel viral insertion site amplification (VISA) technique permits the rapid sequence-based identification of loci that potentially play a role in the development of these tumors (Hansen et al. 2000). The VISA approach is based on a technique called restriction-site PCR (Sarkar et al. 1993), which allows priming from a single site flanking the provirus with no prior knowledge of the genomic sequence and does not require a subcloning step, reducing false-positive detection. Here we have isolated 550 VISA sequence tags (VSTs) from screening tumors that developed in 204 mice from nine AKXD strains. These tags identify 74 CISs, of which 21 have not previously been identified, and represent new candidate proto-oncogenes and TSGs for lymphoid tumors.

Materials and methods

AKXD mice

Animals from nine AKXD RI strains of mice that develop either predominantly B-cell or mixed T/B-cell lymphomas were obtained from The Jackson Laboratory (Bar Harbor, ME), aged to 18 months, and monitored for signs of illness. Moribund animals or normal animals at the age of 18 months were autopsied and evaluated. Both normal and tumor tissues from a single lymphomatous animal was snap-frozen in liquid nitrogen.

Classification of tumors

A portion of each tumor was fixed in 10% formalin and embedded in paraffin. These fixed tumors were analyzed for histopathology as previously described (Morse et al. 2001) at the Laboratory of Immunopathology at the National Institute of Allergy and Infectious Diseases at the National Institutes of Health (Bethesda, MD).

Genomic DNA was extracted from both normal and tumor tissues as described (Wu et al. 1995) followed by a single phenol/chloroform extraction and ethanol precipitation (Sambrook et al. 1989). Restriction enzyme-digested genomic DNAs from tumors were electrophoresed and transblotted (Sambrook et al. 1989). Membranes were prehybridized and hybridized as described (Church and Gilbert 1984). Probes for IgH, IgK, JB1, and JB2 were described previously (Kronenberg et al. 1985; Mucenski et al. 1986). Tumors were classified as B, T, or mixed lineage based on BCR (IgH and IgK) or TCR (JB1 or JB2) gene rearrangements. In cases where histopathology and genomic rearrangements were contradictory, histopathology data took precedence in classifying the tumor, but all information is deposited in the database.

VST identification

The VISA procedure was performed as described previously (Hansen et al. 2000). Briefly, the amplification of mouse genomic DNA flanking proviral insertion elements was performed in a two-step PCR. The first round was performed using a degenerate, restriction-site-specific primer containing the M13 sequence and a nested primer (SP1) specific to the ecotropic MuLV LTR. The second round was performed using 1 μl of first-round PCR product as template, a nested LTR-specific primer, and the M13F primer. The products of second-round PCR were analyzed by gel electrophoresis and the tumor-specific VISA products were excised (reducing endogenous virus detection), purified, and sequenced directly. DNA isolated from brain tissue, which is rarely infiltrated by tumor cells, was used as a control to identify all nontarget amplification products such as endogenous proviral insertion sites, internal proviral sequences, or single primer amplification products generated by the restriction-site-anchored primer. Purified DNAs were amplified using the ABI Prism BigDye Terminator Cycle Sequencing Kit (Applied Biosystems, Foster City, CA) and electrophoresed on an ABI Model 377 DNA sequencer (Applied Biosystems). Sequences were analyzed using Sequencher software (Gene Codes Corp., Ann Arbor, MI). Sequences were edited to remove ambiguous bases and examined for sequence overlap. As a control for specificity, each sequence from the second-round PCR was analyzed to confirm the presence of the target restriction enzyme (RE) recognition site and viral LTR sequence. When a single insertion mutation was amplified multiple times using different RE primers from a single tumor, only the longest sequence was designated as a VISA sequence tag (VST) and was used for further analysis. Repeat-masked VST sequences were compared using BLASTn (ftp://ftp.ncbi.nih.gov/blast/executables/) against the mouse genome assembly (Ensembl release 40 from NCBI assembly m36), which was downloaded from the Ensembl genome annotation website (ftp://ftp.ensembl.org/pub/current_mus_musculus/data/fasta/dna/). The BLAST results were visualized by an in-house-developed program (VST Visualization) to confirm that the highest-ranking alignment was the only highly ranked alignment and the match was not a repeat sequence. All valid hits from the BLAST results were deposited in our MS-SQL Server databases. Based on the information from the BLAST results, VSTs were mapped to the mouse genome (http://www.ensembl.org/Multi/martview), and flanking genes for each of the VSTs were identified. All Ensembl annotated genes within 200 kb of each VST location are displayed in the VISION database (http://www.mouse-genome.bcm.tmc.edu/vision).

Calculation of false positives

The proportion of cancers caused by genetic background versus virus insertion can be differentiated using basic probability:

$${\rm{Probability}}\,{\rm{of}}\,a\,{\rm{or}}\,b\, = \,{\left( {P_{a} \, + \,P_{b} } \right)}\, - \,{\left( {P_{a} \,*\,P_{b} } \right)} $$

Cancer rate in the absence of transposable virus is 12.5% (Mucenski et al. 1986); overall AKXD cancer rate (with transposable virus) is 43% (this article):

$$ 43\% \, = \,{\left( {12.5\% \, + \,V} \right)}\, - \,{\left( {12.5\% \,*\,V} \right)} $$

where V is the cancer rate due to viral insertion and is 39%. The proportion of cancer that can be attributed to viral insertion in this study is

$${V/{\rm{total}}\,{\rm{cancer}}\,{\rm{rate}}\, = \,0.39/0.43\, = \,91\% \,} $$

Since viral insertion is causative in 91% of cancers and the viral insertion number in cancers is low, one would expect a low rate of false positives. The average number of proviral insertions required to cause cancer in this model is unknown but it must be at least 1 (since viral insertion is the primary mutagen) and less than 5.5 (average viral insertions per tumor). This allows us to calculate a range of true positives that we should expect to recover:

204 tumors * 91% requiring viral insertion * 1–5.5 insertions required to induce cancer = 186–1021 insertions contributing to cancer * 49% detection rate of VISA = 91–500 true positive VSTs recovered

Calculation of degree of saturation

We have used a truncated Poisson distribution to determine the number of undiscovered alleles from the frequency of alleles with more than one mutation (f), using the equation:

$${f\, = \,{{\left( {1 - \,e^{{ - m}} \, - \,me^{{ - m}} } \right)}} \mathord{\left/ {\vphantom {{{\left( {1 - \,e^{{ - m}} \, - \,me^{{ - m}} } \right)}} {1 - e^{{ - m}} }}} \right. \kern-\nulldelimiterspace} {1 - e^{{ - m}} }} $$

(Johnsen 2000) where m is mean number of insertions per locus, f is multiple-hit common insertion loci/total loci (from above):

f = 46/186 = 0.2473

f = 46/1021 = 0.0451

% undiscovered loci = e m = 58–91%

saturation = 1−e m = 9–42%

Real-time quantitative RT-PCR (QPCR)

Total RNA was isolated from frozen tumor tissue samples with RNA-stat (Tel-test Inc., Friendswood, TX). cDNA was prepared from 2 μg of RNA with the Superscript II Reverse Transcription Kit (Invitrogen, Carlsbad, CA) using random hexamers. QPCR primers were designed using Primer Express (Applied Biosystems) with amplicons spanning exon junctions in most cases. Optimal concentrations were determined for each primer pair by running all pairwise combinations of 50-, 300-, and 900-nM primers on wild-type mouse blood, spleen, or embryo cDNA. Real-time PCR was performed in the ABI 7000 TaqMan. Applied Biosystems SYBR Green master mix was used with 1 μl of cDNA/well in 96-well optical plates. PCR cycling conditions were 1 × (95°C for 10 min) and 40 × (95°C for 15 sec; 60°C for 1 min), with a dissociation curve performed following amplification.

Results

Screening for cancer loci

A total of 469 male and female mice from nine different AKXD RI strains that develop primarily B-cell or mixed B/T-cell leukemia or lymphoma were aged for 18 months. Although tumor incidence varied by strain, 43% (204/469) of aged mice developed lymphomas (Table 1). Tumors were collected, and histopathology was performed to determine cell type. Somatic recombination of immunoglobulin (IgH, IgK) and T-cell receptor β-chain genes (JB1 and JB2) was analyzed by Southern blot to further refine lymphoid cell typing. The combined histologic and molecular criteria used to determine cell type demonstrate the high frequency of B-cell lineage neoplasms (48%; Table 2) (Morse et al. 2001). The molecular cell type, stage, and nature of each tumor can be found at the VISION database.

Table 1 AKXD mouse, tumor, and VST information by strain
Table 2 Distribution of tumors collected from AKXD mice

Tumors were screened by the VISA technique to identify the location of somatic retroviral insertion sites (Hansen et al. 2000). To avoid redundancy, multiple sequence tags identified by different restriction enzyme primers from a single animal that overlap were edited into a single VISA sequence tag (VST) representing the longest unique flanking sequence (Fig. 1A). The fact that our technique amplified from the 5′ end of the provirus alone ensured that nonoverlapping tags from a single tumor were not acquired. From over 3000 sequence tags, a total of 550 nonendogenous VSTs were identified in tumors that occurred in 204 animals.

Fig. 1
figure 1

A VISA screening yields the exact base pair of insertions. The VSTs that define Evi1 are shown aligned by CLUSTAL W to illustrate VST clustering analysis. VST sequences are generated from amplification from a restriction enzyme site (underlined) anchored genomic primer and a virus specific primer. The start of proviral sequence (shown in bold) marks the exact base of viral insertion. B Clustered VSTs are displayed graphically in the VISION database with all annotated candidate genes within 200 kb of any insertion. A screen capture of the Notch1 locus is shown here for illustration

VST sequences were subjected to BLAST against the Ensembl mouse genome sequence (NCBI m36). After exclusion of low-scoring matches and repetitive sequence matches, 468 VSTs yielded single significant matches to the genome. These insertions were analyzed for genomic clustering to identify CISs (Fig. 1B). CISs representing endogenous virus loci were identified and excluded based on locations of known endogenous virus, all hits occurring at the same base, and/or VST matches to viral sequence. Of the 468 BLAST matches, 66 map to the locations of 16 endogenous proviruses or virus-like LTR sequences; thus, 402 VSTs identify potential tumor-specific tags.

Identification of CISs from these 402 VSTs was based primarily on proximity of the insertions (average ∼25 kb) but in some cases it also included Southern blotting to confirm that rearrangements were tumor specific, a characterization of aberrant expression of genes at the locus, and/or concurrence with previously identified loci. When VSTs surrounded a single gene but were spread over longer distances, we designated the site as a single CIS (e.g., Evi16; Table 3). We identified 49 CISs that were ranked with high confidence because they have multiple hits in our AKXD tumors (Table 3). When a CIS unique to this screen was defined by multiple insertions, we designated the CIS a lymphoid viral insertion site (Lvis). Loci previously identified in other screens are named in accordance with the published nomenclature. We identified 25 VSTs in a second class of loci consisting of single insertions that identify CISs previously identified in screens of other tumors (Table 4). The 74 CISs are distributed across the entire mouse genome and lie in or near known proto-oncogenes and TSGs, in the vicinity of candidate TSGs and proto-oncogenes, and near a set of predicted genes.

Table 3 Common insertion sites (CISs)
Table 4 CISs with one tag in our database

The Viral Insertion Sites Identifying Oncogenes (VISION) Database

We have constructed a public internet-based graphical database that provides specific information on VSTs, including genomic locations, VST sequence, tumor typing and histopathology, insertion-proximal genes, and pathway analysis. Users can use the graphic view to browse CISs by chromosome, viewing distance and orientation of genes with links to additional gene information from Ensembl, and Biocarta pathway information. Users can access primary VST sequence data, histopathology results, and molecular tumor-typing information by clicking on the red vertical bars representing individual VSTs in the graphic view (Fig. 1B), or by clicking on the VST name in the list view. The list view allows sorting of VSTs by chromosome, tumor type, accession number, or VST name, as well as links to individual VST records. The database can be accessed at http://www.mouse-genome.bcm.tmc.edu/vision. VST sequences are also available in GenBank and the RTCGD (http://www.rtcgd.abcc.ncifcrf.gov/) to make these data as widely available as possible.

Estimation of false-positive rate and degree of saturation

MuLV insertions can occur in any chromosome in a sequence-independent manner, but MuLV exhibits a slight preference for insertions near transcriptional start sites and CpG islands (Lewinski et al. 2006; reviewed in Wu et al. 2003). In vitro studies indicate that chromatin conformation also influences target site selection (Muller and Varmus 1994; Pryciak and Varmus 1992; Pryciak et al. 1992). The identification of false positives is possible through the sequencing of clonal mutations that have no association with cancer initiation or progression. Compared to positively selected oncogenic mutations, these uninformative insertions are less likely to occur in the same locus in independent tumors. False-positive CIS identification rates have been estimated by several different methods. Random insertion based on genome size estimates the rate of false-positive detection of cancer loci defined by two proviral insertions in a 26-kb window as one common site in 250 VSTs distributed randomly in the genome (Mikkers et al. 2002). Monte Carlo simulations estimate 17% of CISs are false positive (16.3 simulations/95 detected) using 1200 random insertions (Suzuki et al. 2002; supplemental). Experiments using MuLV in immortalized human cervical cell lines estimate as high as two thirds of two-hit CISs are false positive (Wu et al. 2006). With such a wide variation in estimates, we found it necessary to address this point.

Using experimental data, we first calculated the detection efficiency of VISA by comparing VSTs recovered by VISA to screening by Southern blot hybridization for clonal proviral insertions, a more thorough but lower-throughput method of insertion detection. We recovered an average of 2.7 VSTs per tumor by the VISA method (550 nonendogenous VSTs/204 tumors). Previous analyses showed that 1–10 somatically acquired proviruses can be detected by Southern blotting in these tumors (data not shown). Screening by Southern blot at eight loci yielded several insertions not detected by VISA, showing that our VISA tagging identified approximately 49% (24/49) of the insertion events at these sites. This allows us to estimate an average of 5.5 somatic proviral insertions per tumor (2.7 VSTs/49% detection rate). Using the 550 nonendogenous VSTs (including hits in repeat sequence and hits with no BLAST match), we calculate 1122 potential VSTs (550/49%). By isolating cancer rates due to viral insertion, we calculate that there are between 186 and 1020 true-positive VSTs in the pool of 1122 total (based on 1–5.5 proviral insertions required for tumor formation; see Materials and methods section), so 17–91% of the 550 collected VSTs identify true-positive cancer-associated VSTs. The lower end of this range represents an average of 1 viral insertion/tumor required to cause cancer and the upper end represents an average of 5.5 insertions/tumor required to cause cancer. These 550 nonendogenous VSTs were narrowed to 165 VSTs at 74 loci by clustering analysis, representing a dramatic minimization of false positives.

Calculations of mutagenic saturation provide a rough estimate of the completeness of a data set. Modeling insertion frequency with a Poisson distribution estimates that 9–42% of the readily identifiable loci have been identified by at least a single insertion (see Materials and methods section). This is a low degree of saturation and fits with the low degree of overlap with other screens, low multiplicity of insertion at CISs, and a high degree of unique data from this study. This calculation of saturation does not account for variability in insertion efficiency at different loci and may not accurately predict the number of oncogenes in hematopoietic cells. However, the calculation does provide an estimate of the number of sites likely to be identified by retroviral insertion in the AKXD strains used in this screen and what proportion have been identified by our VST screen.

Expression of genes flanking proviral insertion sites

In many instances there are a number of genes in the vicinity of a specific CIS with one, and possibly more, being affected by the proviral integration event. Retroviral insertions are known to alter gene expression at distances of over 50 kb (Hansen and Justice 1999b; Jonkers and Berns 1996), so the analysis of gene expression in tumors allows for the validation of candidate genes and a determination of whether insertions upregulate or downregulate expression of the flanking genes. We selected 16 candidate genes with one or more proximal insertions at nine different loci to test expression by real-time quantitative PCR (QPCR). The loci were selected based on the presence of interesting candidate genes. Several genes were selected based on their association with the Notch signaling pathway (Notch1, Notch2, Pofut1, Psen1, Numb, Lfng, Dtx2, ADAM precursor, and ADAM 3, 9, 19, 32) or because they map in close proximity to Notch genes (Egfl7). The remaining genes were selected because they were sites of very high numbers of insertions (Sox4, rRas2, and Copb1). Nineteen tumors with insertions and two control nontumor AKXD tissues were assayed for expression of all 16 genes. Three of four CISs showed significant upregulation of one or more proximal genes and two of five single insertions showed upregulation. Significant upregulation of gene expression generally correlated with the presence of insertions proximal to the upregulated gene, particularly at CISs (Fig. 2). In some cases we found genes to be upregulated without proximal insertions. This could be due to failure to detect a proximal insertion or mutation of another gene in the pathway that led to upregulation.

Fig. 2
figure 2

Misregulation of genes proximal to insertion loci. A set of 16 genes at 9 loci (4 CISs, 5 single hits) with proximal insertions was tested by real-time QPCR for misregulation in 19 tumors bearing those insertions and 2 control tissues. Tumors are labeled according to the organ of origin (mLN = mesenteric lymph node; LN = axillary, cervical, and/or branchial lymph nodes). Only genes that displayed significant misregulation are displayed here. Three CISs assayed (Notch1, Notch2, Ras/CopB1) showed misregulation of at least one proximal gene, two of the five single hits showed significant change of one gene. Real-time PCR threshold cycle (Ct) readings were normalized for cDNA content comparison to the average Ct of 16 genes assayed for each tumor using the formula: ΔCt = (Ctsample) - (CTavg. for tumor). To normalize scale across several genes, fold change relative to average ΔCt for each gene across all tumors (except tumors with insertions near the gene) was calculated using the equation: \( {{\rm{fold}}\,{\rm{change = }}\,2(^{{\Delta {\rm{Ct}}}} _{{{\rm{avg}}{\rm{. for gene}}}} )\, - \,(^{{\Delta {\rm{Ct}}}} _{{{\rm{sample}}}} )} \). To determine significant changes in gene expression, the average and standard deviation of normalized values were calculated for each gene across all tumors. Values on the table are expressed as the number of standard deviations from the mean of fold changes for each gene. The shaded box designates −1 to +1 standard deviations. Values outside this box are considered significant. Black arrows denote significant misregulation of genes in tumors harboring insertions near the misregulated genes. rRas2 and Copb1 are at the same locus and most insertions fall between the two genes

Discussion

Using VISA, we identified the genomic sequences flanking somatically acquired proviral insertions in tumors from nine AKXD RI strains that develop a high incidence of B-cell or mixed B/T-cell leukemia and lymphoma. The VISA technique has an advantage over PCR techniques that require a subcloning step because PCR products are sequenced directly, reducing the likelihood of sequencing contaminating products. Furthermore, because a rare PCR product is not likely to be subcloned and sequenced, such a technique focuses on proviral insertions that are monoclonal in the tumor population and, therefore, are likely cause tumor formation. Analysis of the cellular sequences allowed us to identify 74 loci that identify one or more candidate proto-oncogenes or TSGs.

Assessing data quality was a priority, with reports of potentially high rates of false-positive CISs (de Ridder et al. 2006; Wu et al. 2003). In AKXD mice, viral integration is the primary mutagen contributing to cancer formation. Base cancer rates from AKXD-20 mice that lack a transposable AKV virus were compared to rates from AKXD mice containing active virus in this study, allowing us to calculate that 91% of cancers in these AKXD mice are causally associated with at least one viral insertion. Of our 550 nonendogenous VSTs, we calculate 17–91% identify true-positive CISs. This range is dependent on the average number of true-positive insertions required to cause cancer, with 17% (true-positive percentage) of 550 VSTs representing a situation where only one viral insertion per cell on average was required to cause cancer. To reduce the false-positive rate, we used clustering of insertions to identify cancer-associated loci. If insertions are identified at the same locus in independent tumors, they are more likely to be a result of positive selection and representative of oncogenic mutations. Clustering analysis refines the pool of 550 VSTs to 165 VSTs, reducing the false-positive rate. These 165 VSTs represent 74 frequent insertion loci selected for oncogenicity. To reconcile these observations with the high observed false-positive rate in immortalized human cervical cell lines (Wu et al. 2006), we have only to point out that as cancer-associated insertions are positively selected, insertions that are not associated with cancer are negatively selected, reducing the background false-positive rate. Because every sample analyzed here was cancerous, we have enriched for insertions causing cancer, reducing the number of “random” insertions and the false-positive error rate. Furthermore, the degree of saturation of our VST detection indicates that we have likely not identified each of the CISs causative of cancer in this lymphoid tumor bank.

A comparison to similar screens in other mice (Mikkers et al. 2002; Shin et al. 2004; Suzuki et al. 2002) indicates that the CISs identified in our study are likely to be influenced by differences in genetic background. We compare our data to the screen described by Suzuki et al. (2002) because it was of similar scale, included AKXD RI strains and lymphoid neoplasms, and neither screen was based on a known sensitizing mutation. The Suzuki screen identified 493 tags at 152 CISs, whereas we identified 140 tags at 46 CISs. As a measure of saturation, allele frequencies for CISs (number of insertions/number of loci) in our VST screen were comparable to the Suzuki screen: 3.04 (140/46) (this article) vs. 3.24 (493/152) (Suzuki et al. 2002). One would expect to find a much greater allele frequency/saturation with more CIS tags when screening the same number of possible cancer loci. The discrepancies between the two screens for allele frequency and number of sites identified can be partially explained by the number of different strains used for each screen and strain genetic background. Target site preferences of provirus have been shown to vary among mouse strains and among AKV subtypes (Martin-Hernandez et al. 2006). Our screen used a subset of AKXD strains that are susceptible to B-cell neoplasms and the Suzuki screen used more AKXD strains, fewer aged mice per strain, and NFS.V+ and BXH2 RI mice. It is likely that strain variations, particularly predisposing mutations in individual AKXD strains, are influencing the pool of alleles recovered in common. Therefore, the calculated saturation is measuring primarily the mutation of loci that cooperate with existing cancer-predisposing mutations in these strains, rather than all possible cancer loci.

Additional comparisons between the two screens showed that only 15 of the 46 multihit CISs were identified as CISs in the previous screen. Low overlap between the screens suggests that many oncogenes and TSGs identified at these CISs will be unique to this AKXD tumor subset. Except for hits at the Lvis1 and Sox4 loci, the allele frequency was nearly the same for the remaining 14 shared loci with 3.15 (41/13) insertions/locus in our screen vs. 3.46 (45/13) in the Suzuki screen, indicating similar levels of saturation. This also demonstrates that Suzuki et al. (2002) were sampling from a larger pool of mutations because they found very similar mutation frequencies for most targeted genes that were common between the screens, yet they mapped more than twice as many sequence tags. The implication of this finding is that these screens are identifying cancer loci that cooperate with predisposing cancer loci carried in these genetic backgrounds at a high rate while identifying all blood cancer loci at a lower rate. This also raises the possibility that strain-specific germ line insertions may be acting as sensitizing mutations, influencing the frequency with which specific loci are identified as well as the occurrence of specific tumor types.

Another measure of the quality of the data is the ability to replicate previous work and to identify validated oncogenes. As expected, many of the insertional mutations identified CISs that were known from previous screens, including Evi1, 3, 11, 16, 32, 38, 53, 54, 84; Lvis1-9; Myc, N-Myc, Ccnd1, Ccnd3, Ris2, Nras, Notch, and Gfi1 (Hansen and Justice 1999a; Hansen et al. 2000; Li et al. 1999; Mikkers et al. 2002; Mucenski et al. 1988a; Shin et al. 2004; Suzuki et al. 2002). This validates the VISA method for identification of CISs; however, the most interesting data are the 21 loci not previously identified as CISs. These sites may be specific to a lymphoid subset of tumors, and, in particular, many may be unique to the B-cell lineage. Several of these novel sites have been validated as novel cancer-causing genes by gene expression in tumors or by oncogenicity after bone marrow transductions, or in transgenic mice, including Hex (George et al. 2003), Evi3 (Hentges et al. 2005), Tnfrsr13c (Hentges et al. 2002), Evi1 (Boyd et al., 2006), Sox4 (Boyd et al. 2006), and Eg5 (Castillo et al., unpublished). Of note, the Lvis24 (Tnfrsf13c) site was first detected as only a single hit in our VST database, but was pursued because it was commonly overexpressed in human lymphoid tumors in the Cancer Genome Anatomy Project (CGAP) database, showing that some unclustered single insertions are true positives.

The Lvis1 site is the most frequently rearranged locus in this tumor subset, with retroviral insertions detected in 26 independent tumors. Subsequent tests of gene expression showed that that although the insertions occurred 50 kb distant of the hematopoietic transcription factor Hex, they upregulated the expression of Hex as well as the kinesin-related spindle protein Eg5. Subsequently, both genes have shown oncogenicity in bone marrow transductions (George et al. 2003) or transgenic mice (Castillo et al., unpublished), indicating that retroviral insertions may influence the expression of two or more oncogenes at a single site and affect genes more than 50 kb distant. Such an observation led us to develop the VISION database such that all genes in the region of the proviral insertions are displayed, rather than indicating only the closest gene to the retrovirus. Similarly, all genes in the region of the proviral insertion site are listed in Table 1 because any or all of them are candidate proto-oncogenes.

Histologic analysis reveals that this tumor collection is composed of primarily lymphoid tumors; therefore, we would expect to identify known or suspected proto-oncogenes and TSGs implicated in B-cell lymphomagenesis and/or normal B-cell differentiation. Indeed, several genes of this type are identified, including Bcl11b (Cismasiu et al. 2006), Plu-1 (Lu et al. 1999), Atp1b1 (Chen et al. 2006; Le Page et al. 2006), Blzf1/Jem1(Duprez et al. 1997), Ccr4 (Yoshie et al. 2002), Ccr7 (Albert 2006), Ceacam I and II (Obrink 1997), and Traf1 and Malt1 (Nakagawa et al. 2005). Many of these have been implicated in lymphoma development by virtue of aberrant expression in tumors or as targets of translocation in human cancers. Finding these genes not only provides support for the validity of our methods but it also provides further evidence that these genes are a cause rather than a consequence of cancer. In some cases several genes that are known to function within an established or proposed signaling pathway were identified in our screen. Insertions identifying genes that act within oncogenic pathways support a functional interaction and delineate pathways involved in cancer.

The first such pathway is the Notch signaling pathway. Several members of the Notch pathway are identified as candidates, including Notch1, Notch2, Dtx2, Lfng, Psen1, Numb, Pofut1, Tcf3/E2A, and ADAM3, 5, 9, 18, and 32. We find insertions not only near Notch1 (a regulator of E2A and EBF) (Smith et al. 2005) and E2A, but also in Evi3, which modulates EBF activity (Hentges et al. 2005). Insertions in E2A/Tcf3 lie within the gene, which may indicate a loss of function followed by loss of heterozygosity. This is consistent with the ability of Notch1 to repress E2A activity and the observed upregulation of Notch1 in tumors. Cbfa2t3h (ETO-2) is 75% identical to ETO, which is a frequent target of translocations (with AML1) in humans. The resulting AML-ETO fusion protein, like Notch, inhibits E protein activity (Plevin et al. 2006). Independent of AML-ETO fusions, ETO itself is able to inhibit E-protein-dependent transactivation activity (Zhang et al. 2004). Likewise, ETO-2 binds SCL (a T-ALL oncoprotein) in a complex that inhibits E-protein activity. SCL also binds GFI1B (Schuh et al. 2005), the functional equivalent to Gfi1 in blood cells (Fiolka et al. 2006). Gfi1 is a frequent target of insertions in this screen and is the homolog of the Drosophila gene senseless, a modulator of Notch signaling (Nolo et al. 2000). Not only do we observe frequent insertion near Notch signaling genes, but these genes are misexpressed in the cancers. QPCR was used to assess expression levels of several candidate genes, including several Notch signaling genes, where Notch1, Notch2, Numb, and Pofut all showed significant upregulation (Fig. 2). Associations between these genes are illustrated in Fig. 3. Mutation of many genes that affect Notch signaling and E protein signaling identifies these pathways as common and powerful routes to cancer in blood cells.

Fig. 3
figure 3

The Notch pathway is a frequent target of insertion. Members of the Notch signaling pathway were identified repeatedly as candidate oncogenes. Shaded boxes indicate genes with proximal insertions. Arrows represent direct or indirect effects that occur in one or more hematopoietic cell types. These associations are not necessarily occurring contemporaneously or in the same cell type

NFκB signaling is frequently involved in proliferation and survival signals of many blood cell lineages. NFκB signaling genes Ikka/Chuk, Malt1, and Traf1 were identified by multiple insertions in this screen. MALT1, IKKB, and IKKA/CHUK form a complex that is able to activate NFκB signaling in T and B cells (O’Mahony et al. 2000; Wegener et al. 2006; for review see Thome 2004). TRAF1 is able to activate NFκB during B-cell activation via CD40 but may have many roles in lymphoid development (Xie et al. 2006; for review see Lee and Lee 2002). These data indicate that aberrant activation of NFκB contributes to lymphoid cancer and that other members of this pathway are potential oncogenes.

Evi1 was originally identified through viral mutagenesis as a myeloid oncogene (Mucenski et al. 1988b). A screen for targets of EVI1 transcription factor activity yielded several candidate genes, including Map3k14 (Yatsula et al. 2005). Our screen identified three insertions near Evi1 and two insertions near Map3k14, suggesting a functional pathway for Evi1 and Map3k14 in cancer.

A pair of related genes, Hck and Fgr, were identified as candidate oncogenes by this screen. Notably, both of these Src family kinases are used for β-integrin outside-in signaling (Giagulli et al. 2006). In acute myeloid leukemia blasts, the expression of FGR and HCK is associated with early commitment and differentiation of myeloid lineages (Willman et al. 1991). These molecules are also critical for pre-B ALL induction by BCR/ABL (Hu et al. 2004). Furthermore, comparative genomic hybridization showed amplification of FGR in endometrial cancer (O’Toole et al. 2006). The somatic mutations found in the Fgr and Hck loci provide additional evidence linking downstream signaling of these two genes to cancer. The preceding pathways are all normally involved in proliferation or differentiation of lymphoid lineages, lending further validity to their identification as oncogenes or TSGs in a primarily lymphoid cancer collection.

The screen of the lymphoid tumors described here provides a new collection of 21 loci that are associated with leukemia and lymphoma. The majority of tumors recovered were of B-cell origin, indicating that the loci are identifying genes involved in normal B-cell development and, when expressed aberrantly, in cancer. Indeed, many of the genes identified are B-cell-specific genes with known functions in development and maturation. Insertions near several members of multiple pathways identify networks of signaling active in the development and progression of blood cancers. Notably, members of the Notch signaling pathway were frequently identified and, for some, deregulated expression was verified by QPCR. The identification of signaling networks active in cancer will allow us to understand cancer as a misregulation of cell processes rather than a collection of mutations leading to a phenotype. This is particularly useful for the evaluation and prioritization of drug targets and the development of effective treatments for disease.