Background

Bananas and plantains (Musa spp.) are perennial giant herbs grown in humid tropical and subtropical regions. Their annual production exceeds 100 million tons, out of which almost 90% is targeted for local and national markets [1]. Cultivated bananas are parthenocarpic, seed-sterile, vegetatively-propagated diploid, triploid and tetraploid clones. Most of them are hybrids between two diploid (2n = 2x = 22) species M. acuminata and M. balbisiana [2] with the A and B genomes respectively. The production of bananas is threatened by many diseases and pests, but the clonal nature, seed sterility and the lack of knowledge on the origin of cultivated clones hampers breeding of improved cultivars. It is expected that the use of molecular tools will speed up banana germplasm improvement. Sadly, although the socio-economic importance of bananas and plantains cannot be questioned, Musa remains outside the focus of major research programs and must be considered an under-researched crop.

This situation is reflected by a limited knowledge of the banana nuclear genome, even though it is relatively small (1C ~ 600 Mbp) [3, 4]. It has been estimated that about 55% of the genome is made of various DNA repeats [5], but only a limited number of repetitive DNA sequences has been characterized. Valárik et al. (2002) described twelve Radka repeats [6], representing partial sequences of various mobile elements and rRNA genes. Other characterized sequences included a Copia-like element [7, 8], a species-specific element Brep-1 [9, 10] and a Ty3/gypsy-like retrotransposon monkey [11]. In order to identify more repeats, Hřibová et al. [5] applied a low-Cot DNA isolation technique to characterize highly repetitive fractions of banana genome. An important step forward in dissecting the Musa genome was made by Cheung and Town (2007), who sequenced ends of more than 6,000 BAC (Bacterial Artificial Chromosome) clones [12]. Moreover, 62 BAC clones were completely sequenced through a Generation Challenge Programme funded project (GCP 2005-15), within the context of the Global Musa Genomics Consortium [13]. Nevertheless, even after these efforts, the knowledge on the repetitive part of Musa genome remains far from complete.

Recent introduction of the next generation sequencing methods [14] provided powerful tools to discover and characterize DNA repeats, even in complex plant genomes. For example, Macas et al. (2007) used the 454 technology to characterize repetitive DNA in the nuclear genome of pea (Pisum sativum L.) [15]. Despite the relatively small proportion of sequenced DNA relative to the whole genome (33.3 Mb or ~ 0.77% of the genome), the authors identified and characterized most types of retrotransposons and discovered thirteen new families of tandemly organized repeats. In a similar study, Swaminathan et al. (2007) used the 454 system to sequence 7.5% of the soybean genome [16].

This study addresses the lack of knowledge on the repetitive part of the banana genome by characterizing all major DNA repeats after massively parallel sequencing of genomic DNA of a diploid clone of M. acuminata. The experimental approach follows that of Macas et al. [15], in which all-to-all similarity comparison of 454 reads is performed to identify groups (clusters) of overlapping reads representing repetitive genomic sequences. As the number of reads in individual clusters is proportional to genomic abundance of corresponding repeats, this information can be used for quantitative analysis of repetitive genome landscape. In addition, consensus sequences of the repeated elements can be obtained by assembling the reads within clusters. We also demonstrate how databases of 454 reads sorted according to the type of repeats can be used to identify and classify repeats in BAC sequences. Finally, the large set of sequences obtained in this study provides a unique source of molecular markers potentially useful in genome mapping, anchoring physical maps, analyzing genetic diversity and for phylogenetic studies.

Results and Discussion

A diploid clone M. acuminata cv. 'Calcutta 4' was chosen for sequencing as it has been used extensively as a model genotype in previous molecular studies [5, 6, 12, 1719]. Moreover, this clone is being used in various banana breeding programs as a source of diseases resistance [20, 21]. A sequencing run of nuclear DNA on the GS FLX platform (454 Life Sciences/Roche) resulted in 477,699 reads with average length of 206 bp, providing a total of 98,538,911 bp of sequence data. Considering genome size of 'Calcutta 4' (1C = 623 Mbp) [4], this represents 15.7% of the genome. The sequencing reads were clustered based on their similarity and all clusters containing at least 20 reads (roughly representing 0.01% of the genome) were further investigated.

LTR-retrotransposons

The most abundant DNA sequences found in the banana genome were LTR-retrotransposons. Out of them, Ty1/copia represented more than 16% of the genome while the Ty3/gypsy elements represented about 7% of the genome (Figure 1). This is an interesting observation as the available data from other sequencing projects indicate prevalence of Ty3/gypsy retrotransposons in plant nuclear genomes [15, 2224]. In order to get insight into the diversity of banana LTR-retrotransposons, we performed phylogenetic analysis based on a comparison of their reverse transcriptase domains. This work revealed that while the more abundant Ty1/copia-like elements were represented by four distinct evolutionary lineages (Figure 2A), vast majority of Ty3/gypsy-like elements belonged to a single evolutionary lineage of chromoviruses [25] (Figure 2B).

Figure 1
figure 1

Genome proportion of major groups of repetitive sequences identified in banana 454 data.

Figure 2
figure 2

Phylogenetic analysis of Musa retrotransposons based on RT sequences. Unrooted phylogenetic trees of Ty1/copia elements (A) and Ty3/gypsy elements (B). Names of the contigs assembled from 454 reads are printed in purple. Classification of the Ty3/gypsy lineages and chromoviral clades was done according to [3032]. Major lineages of Ty1/copia elements were named according to a selected representative of each group.

About 74% of identified Ty1/copia sequences belonged to the SIRE/Maximus lineage [26, 27], representing almost 13% of the genome. The remaining Ty1/copia elements belonging to Angela, Tnt1 and Hopscotch lineages [2729] represented only about 3.0, 1.0 and 0.2% of the genome, respectively. Interestingly, fluorescence in situ hybridization (FISH) on mitotic chromosomes revealed that elements from distinct evolutionary lineages have different patterns of genomic distribution. The elements from the SIRE/Maximus and Angela lineages were concentrated in several discrete clusters on all chromosomes (Figures 3D, E) and the elements from the Tnt-1 lineage gave only weak signals preferentially localized in distal parts of mitotic chromosomes (Figure 3F). Elements belonging to the Hopscotch lineage were not tested for the distribution because of their very low proportion in the genome.

Figure 3
figure 3

Genomic distribution of different types of DNA repeats. Mitotic metaphase spreads of M. acuminata cv. 'Calcutta 4' (2n = 22) after FISH with probes for various repeats. The chromosomes were counterstained with DAPI (blue). Bar = 5 μm. (A) Tandem repeat CL18 (green signal) formed a cluster on one pair of chromosomes (long arrows). (B) Tandem repeat CL33 (red signal) localized on two pairs of chromosomes (long arrows). (C) Simultaneous hybridization of probes for CL18 (green signal) and CL33 (red signal, long arrows) revealed co-localization of both satellites on one pair of chromosomes (short arrows). (D) Two metaphase plates after FISH with a probe for CL1SCL2Contig1080 - banana retrotransposon belonging to SIRE/Maximus lineage (green). Uneven genomic distribution with clusters dispersed on all chromosomes is obvious. (E) Similar genomic distribution was found for banana retroelement related to the Angela lineage (CL2Contig49). (F) Banana retrotransposon belonging to Tnt1 lineage (CL10Contig16, red color) gave weak signals preferentially localized in distal parts of chromosomes (long arrows). (G) The most abundant type of Ty3/gypsy-like element of the Reina lineage (CL1SCL5Contig891) localized preferentially to centromeric or peri-centromeric regions of all chromosomes (green signals). (H) Also the Ty3/gypsy-like element related to Tekay evolutionary lineage (CL4Contig82) clustered in centromeric or peri-centromeric regions of all chromosomes. (I) A probe derived from LINE element (CL1SCL8Contig452) localized in the centromeric regions of all chromosomes (green signals).

Ty3/gypsy-like retrotransposons showed relatively low degree of phylogenetic diversity and most of them belonged to the lineage of chromoviruses. This single lineage comprised about 87% of Ty3/gypsy elements identified in this study, thus greatly outnumbering elements from the Tat lineage, which included all other Ty3/gypsy elements identified in the banana genome. The chromoviral sequences could be classified into four clades: Galadriel, Tekay, Reina and CRM [3032]. The most abundant chromoviral clade was Reina, which involved more than half of all chromoviral sequences, making up about 4% of the banana genome. Many elements belonging to this clade appeared to be non-autonomous as they lacked parts of RT-coding domain (data not shown). Members of the Tekay clade were found to be the second most abundant group of chromoviruses, reaching about 2% of the genome. Sequences from the Galadriel clade corresponded to the retrotransposon monkey, which has been identified earlier in the banana genome [11]. The consensus sequences of the monkey retrotransposon assembled from our 454 data as a 5880 bp fragment showed 95% similarity to the monkey element described by Balint-Kurti et al. (2000) [11]. Previous estimates of the copy number using slot-blot analysis indicated that monkey constituted about 0.2 - 0.5% of the M. acuminata genome [11] and are on line with our estimates based on the proportion of monkey-derived sequences in 454 reads (Additional file 1). Although the monkey was supposed to be the most abundant repetitive element in banana [5], our data showed that several other families of retroelements account for much larger parts of the genome. The CRM clade sequences occupied a similar fraction of the genome as those from the Galadriel clade. Although being members of the same evolutionary lineage, banana chromoviruses from distinct clades partly differed in their chromosomal distribution. Contrary to monkey, which preferentially localized in secondary constrictions [11], members of other clades occupied mostly pericentromeric regions and some additional loci in distal parts of all chromosomes (Figures 3G, H).

Non-LTR retrotransposons and DNA transposons

Compared to LTR-retrotransposons, non-LTR retrotransposons and DNA transposons were found relatively rare (Additional file 1). Within the clusters that represented at least 0.01% of the genome, only one cluster of LINE sequences [33] and two clusters of DNA transposons were identified. The LINE elements were estimated to constitute about 1% of the banana genome. FISH with a probe derived from reverse transcriptase domain of a LINE-like element, resulted in dot signals in centromeric regions on all chromosomes (Figure 3I). DNA transposons identified in this work included elements that showed similarity to transposons belonging to the hAT superfamily [34]. FISH with a probe derived from hAT-related element failed to give visible signals, most probably due to relatively small copy number. The low abundance of LINEs and DNA transposons seems to be typical for plant genomes and similar abundances were observed for example in rice, grape and maize genomes [23, 24, 35].

45S and 5S rDNA

Clusters containing 45S rDNA represented 1.12% of the genome and the 45S rDNA sequence region was reconstructed as a 7,553 bp fragment that included complete sequence of the 18S-5.8S-26S rRNA locus surrounded by parts of intergenic spacer (IGS). Moreover, based on similarity searches to BAC clone MA4_01C21 from M. acuminata, which was sequenced within the context of the Global Musa Genomics Consortium [13] and which carries 45S rDNA units, another cluster containing IGS-like sequence was identified in the 454 data.

In contrast to Balint-Kurti et al. (2000) whose results obtained after FISH with mitotic chromosomes indicated insertion of a part of monkey into 45S rDNA [11], our 454 data suggests that monkey is not frequently associated with the 18S-5.8S-26S rRNA gene copies. A plausible explanation for this discrepancy is that the insertion is adjacent to the 45S rDNA locus. In fact, the spatial resolution of FISH on mitotic chromosomes is not sufficient to discriminate two loci closer than 5 - 10 Mbp [36, 37]. A close vicinity of the monkey fragment to 45S rDNA is supported by the sequence data of the BAC MA4_01C21 comprising 45S rDNA, in which a 1.5kb fragment of monkey was identified. However, the fragment was not inserted in the 45SrDNA or IGS sequences. The fact that this BAC comprises a chromovirus element from the Tekay lineage and the SIRE1/Maximus lineage indicates that the BAC MA4_01C21 actually encompasses a border of the 45S rDNA locus and flanking genomic sequences, characterized by sequence-heterogeneity and insertion of various mobile elements.

Similar to the 45S rRNA gene cluster, our 454 sequence data enabled reconstruction of the entire coding part of the 5S rRNA gene and its non-transcribed spacer. The 5S rDNA was found to represent 0.38% of the banana nuclear genome. Teo et al. [38] identified a Ty1/copia-like element in the 5S rDNA spacer in several banana species. The analysis of retrotransposon protein coding domains in our data confirmed that the 5S rDNA spacer contained a part of the reverse transcriptase of the Tnt1-like element.

Tandem organized repeats

Repeat reconstruction from the 454 data led to discovery of two new tandemly organized repeats. One of them (CL33) consists of ~130 bp monomer while the CL18 repeat is characterized by ~2 kb monomer unit. FISH on mitotic chromosomes revealed clusters of signals in the subtelomeric regions of one pair of chromosomes (satellite CL18) and weak signals in telomeric region on two pairs of chromosomes (satellite CL33) (Figures 3A, B, C). Southern hybridization resulted in a ladder-like pattern typical for tandemly organized repetitive units for repeat CL33, only (not shown). The repeat CL18 gave a weak smear with a few visible bands, most likely due to partially dispersed distribution and/or poor conservation of the monomer length. A rather low copy number of CL33 and/or long repetitive unit of satellite CL18 may explain why these repeats were not identified in previous studies [5, 6]. In general, the absence of more abundant tandem repeats in the banana genome may be related to its relatively small size. Satellite DNA is a typical component of subtelomeric and centromeric chromosome regions in various plant species, but they often form blocks of repeats in interstitial regions [15, 3942]. The results of this study, as well as our earlier observations [5, 6] indicate that typical centromeric satellite DNA is absent in the banana genome, and that the centromeric regions are likely to be made of various types of retrotransposons.

Identification of DNA markers

Following the thorough characterization of banana repetitive DNA, we screened the 454 sequences for the presence of loci potentially suitable for use as DNA markers. We focused on identification of simple sequence repeats (SSRs) and sites of insertions of transposable elements (ISBP - Insertion Site Based Polymorphism) [43]. In total, 27,946 of 454 reads containing SSRs were identified with repeat units ranging from 2 to 10 bp. The most abundant motifs were dinucleotides TA and GA and trinucleotides GAA (Figure 4). More than 11,000 reads were identified to contain potential ISBP sites, most of them carrying insertions of retrotransposons into unknown low-copy sequences. Databases containing 454 reads carrying SSR sequences and potential ISBPs were established and made publicly available on our website [44].

Figure 4
figure 4

Major groups of microsatellite DNA sequences identified in banana 454 reads.

Repeat identification in sequenced DNA clones

As the next step in utilizing 454 data, we took advantage of the read clustering during repeat analysis and created databases of sequence reads sorted according to their repeat of origin. These databases were then utilized for similarity-based repeat detection and classification in genomic BAC clone sequences implemented at the PROFREP server [45]. The analysis was performed for 49 BAC clones from M. acuminata cv. 'Calcutta 4' and for 12 BAC clones from M. balbisiana cv. 'Pisang Klutug Wulung' [14]. The clones were sequenced as part of the Generation Challenge Program supported project (GCP-2005-15) and were selected based on the presence of resistance gene homologs and other gene-like sequences. Indeed, out of the 49 M. acuminata BAC clones, only 9 clones were highly repetitive. 15 BAC clones contained a single copy sequence with a large repetitive region and the remaining BAC clones comprised single copy sequence without any detectable repetitive DNA and/or carried a very short repetitive region (Figure 5). Out of the 12 BAC clones from M. balbisiana, two were single copy, while the remaining 10 BAC clones carried low copy sequences mixed with large repetitive regions. The repetitive profiles of all 62 BAC clones are available as supplementary data (Additional file 2).

Figure 5
figure 5

Examples of repeat identification in M. acuminata BAC clones. Nucleotide sequences of BAC inserts are represented on X axis. The plots represent genomic copy numbers of individual insert regions calculated from numbers of similarity hits to 454 read databases. The plot colors correspond to different types of the repeats. (A) A 'low-copy' BAC clone showing absence of repeats along its entire length. (B) A BAC clone with relatively long stretches of repetitive DNA. (C) A highly repetitive BAC clone with various types of repeats.

Conclusions

This work represents a major advance in the analysis of the nuclear genome organization in banana, an important staple and cash crop. The application of low-depth 454 sequencing provided until now the largest amount of DNA sequence data, and enabled a detailed analysis of repetitive components of its nuclear genome. All major types of DNA repeats were characterized and Musa DNA repeats databases were established. The analysis of genomic distribution of selected repeats provided new data on long-range molecular organization of banana chromosomes, and a large number of loci potentially useful as DNA markers were identified. The improved knowledge and resources generated in this study will be useful in annotating the banana genome sequence, in the analysis of the evolution of the Musa genome, and for the study of dynamics of DNA repeats over evolutionary time scale, as well as to isolate DNA markers for use in genetic diversity studies and in marker-assisted selection.

Methods

454 sequencing

In vitro rooted plants of M. acuminata cv. 'Calcutta 4' (ITC 0249) were obtained from the International Transit Centre (Bioversity International, Global Musa Genebank hosted by the Katholieke Universiteit, Leuven, Belgium) and grown in a greenhouse. DNA for sequencing was prepared from nuclei isolated from healthy young leaf tissues according to Zhang et al. (1995) [46]. Isolated nuclei were incubated with 40 mM EDTA, 0.2% SDS and 0.25 μg/μl proteinase K for 5 hours at 37°C, and DNA was purified by phenol/chloroform precipitation. The 454 sequencing was performed at the Arizona Genomics Institute (Tucson, USA) using 454 Life Sciences/Roche FLX instrument. All sequence information generated in this study are available on our website [44] and was submitted to the National Center for Biotechnology Information short read archive under accession numbers SRR057410 and SRR057411.

Data analysis

Following a removal of linker/primer contaminations and artificially duplicated reads, the remaining 477,699 reads (average length of 206 nucleotides) were used for repeat analysis. The analysis was performed as described by Macas et al. (2007) [15], employing TGICL [47] and a set of custom-made BioPerl scripts for similarity-based clustering and assembly of reads. The clustering parameters used by a tclust program (part of TGICL) were set to consider pairwise similarity of two reads significant if it involved an overlap of at least 150 nucleotides with 90% or better similarity, representing at least 55% and 70% of the length of longer and shorter read respectively (OVL = 150 PID = 90 LCOV = 55 SCOV = 70). The reads within individual clusters were assembled into contigs using TGICL run with the -O '-p 80 -o 40' parameters, specifying overlap percent identity and minimal length cutoff for cap3 assembler. Repeat type identification was done using blastn and blastx [48] sequence-similarity searches of assembled contigs against GenBank, and by detection of conserved protein domains, using RPS-BLAST [49]. Tandem repeats within contig sequences were identified using dotter [50]. The classification of LTR retrotransposons into distinct lineages and clades was done using phylogenetic analyses of their RT sequences [15]. Alignment of RT sequences was carried out with ClustalX [51] and the phylogenetic trees were calculated using neighbour-joining method. The trees were drawn and edited using the FigTree program.

Microsatellite sequences were identified using Tandem Repeats Finder [52] and TRAP [53] programs, while a BioPerl script was used to identify ISBP loci [54]. Identification and classification of repetitive sequences within BAC clones was done via PROFREP web server [45] utilizing repeat-specific databases of 454 reads prepared in this study. The server performs BLAST-based searches against databases of whole-genome or repeat-specific 454 reads and generates plots of similarity hits along the query sequence (number of hits is proportional to copy number of the query in the genome).

Preparation of probes for cytogenetic mapping

Primers specific for tandem repeats (Additional file 3) were designed from sequence contigs that carry tandem organized repetitive units. Labeled probes were prepared by PCR on M. acuminata 'Calcutta 4' genomic DNA with biotin- and digoxigenin-labeled nucleotides. The PCR premix contained 1× PCR buffer, 1 mM MgCl2, 0.2 mM dNTPs, 0.2 μM primers, 0.5 U Taq polymerase (Finnzymes) and 10 - 15 ng template DNA. PCR reaction was performed as follows: initial denaturation of 3 min at 94°C followed by 30 cycles of 1 min at 94°C, 50 s at 57°C and 50 s at 72°C and final extension step 5 min at 72°C.

Specific primers were also designed for reverse transcriptase (RT) domains of different retroelements (Additional file 3). In the first step, the RT domains were amplified using PCR with a mix containing 1× PCR buffer, 1.5 mM MgCl2, 0.2 mM dNTPs, 0.2 μM primers, 0.5 U Taq polymerase (Finnzymes) and 10 - 15 ng template DNA. PCR products were checked by gel-electrophoresis, cleaned up using paramagnetic beads Agencourt Ampure (Beckman Coulter), cloned into TOPO vector (Invitrogen) and transformed into electro-competent E. coli cells. 48 recombinant clones for each retroelement were PCR amplified using M13 primers and separated on the gel electrophoresis. Clones for each RT domain were than cleaned up using ExoSAP-IT (USB Corporation), and used for Sanger sequencing to verify presence of specific RT domains in the clones. The PCR products were sequenced with BigDye Terminator 3.1 Cycle Sequencing Kit (Applied Biosystems) on ABI 3730xl DNA analyzer (Applied Biosystems). The nucleotide sequences were analyzed and edited using the Staden Package [55], and searched for similarity with the corresponding 454 contigs using BLAST [48]. Clones with the highest similarity to reconstructed contigs were PCR amplified with biotin- and digoxigenin-labeled nucleotides and used as probes for fluorescence in situ hybridization. Selected clones used as probes showed at least 98% similarity with the corresponding 454 sequence.

Fluorescence in situ hybridization (FISH)

FISH was done on mitotic metaphase spreads prepared from meristem root tip cells as described by Doleželová et al. (1998) [56]. The hybridization mixture consisted of 40% formamide, 10% dextran sulfate in 1 × SSC and a 1 μg/ml labeled probe. The mixture was added onto slides and denatured at 80°C for 4 min. The hybridization was carried out at 37°C overnight. The sites of probe hybridization were detected using anti-digoxigenin-FITC (Roche Applied Science) and streptavidin-Cy3 (Vector Laboratories), and the chromosomes were counterstained with DAPI. The slides were examined with Olympus AX70 fluorescence microscope (Olympus) and the images of DAPI, FITC and Cy-3 fluorescence were acquired separately with a cooled high-resolution black and white CCD camera. The camera was interfaced to a PC running the MicroImage software (Olympus).