Introduction

Genome analyses in the model species Arabidopsis thaliana, is a useful tool for comparative genomic studies in the related Brassica genus which include important crop species. Comparative mapping between the genomes of crop plants and their respective model species is becoming a common approach for the identification of markers and candidate genes for mapping studies and to expedite positional gene cloning. Genome sequencing projects for B. rapa and B. oleracea are in the process, providing an opportunity to analyze and study the genome changes associated with the origin and evolution of these species in relation to A. thaliana (Ayele et al. 2005; Lim et al. 2006; Yang et al. 2006; Hong et al. 2006).

The genus Brassica includes three main cultivated species, B. nigra, n = 8; B. oleracea, n = 9; and B. rapa, n = 10, all of which function genetically as diploids. However, early evidence (Sikka 1940), indicated that these species are paleopolyploids, which is also the case for A. thaliana (Blanc et al. 2000). It is widely accepted that Brassica species and A. thaliana are diverged from a common ancestor of about 14.5–20.4 million years ago (Yang et al. 1999). The current thinking is that the common progenitor of the Brassicaceae had a basic genome of n = 4 chromosomes, which underwent a whole genome duplication 24–40 Mya producing a tetraploid species of 2n = 4x = 16 (Henry et al. 2006). The genome of this putative species was similar to the present genomes of A. lyrata and Capsella rubella (2n = 16,  ~230 Mbp of DNA), from which presumable the genomes of the Brassica species derive (Schranz et al. 2006). A. thaliana evolved from this common ancestral species 4–5 Mya after it became diplodized and suffered general gene loss and chromosomal rearrangements including fusions or fissions resulting in the present genome of n = 5 and 157 Mb of DNA. (Johnston et al. 2005; Henry et al. 2006). Based on samplings of less than 2% of the genome, either by molecular marker map construction (Lagercrantz 1998; O’Neill and Bancroft 2000; Park et al. 2005; Parkin et al. 2005; Rana et al. 2004; Schmidt et al. 2003) and FISH (Lysak et al. 2005; Ziolkowski et al. 2006), it has been proposed that the Brassica diploid species are also evolved from the 2n = 4x = 16 ancestral species after additional rounds of genome duplication, resulting in an hexaploid ancestor. This would explain in part the increase in DNA content from 230 Mbp to 529–696 Mbp (Johnston et al. 2005) reported for monogenomic cultivated Brassica species. However, it ignores the fact that ploidy changes are changes in chromosome number. The monogenomic Brassica species have already identical or similar chromosome numbers than those in the putative ancestral species (n = 8, 9 and 10). It also ignores the role of transposable elements which has been estimated to expand as much as 20% of the B. oleracea genome (Zhang and Wessler 2004). Little is known about the genomic structure of the Brassica diploid species. The two main cultivated species B. oleracea and B. rapa have diverged 7.3 ± 4 Mya (Wroblewski et al. 2000) and there is high synteny conservation for at least half of the chromosomes (Parkin et al. 2005). In the present study, we analyze similar sequences which harbor major glucosinolate genes in B. oleracea, B. rapa and A. thaliana in an effort to provide additional clues on the structure and evolution of Brassica genome.

Materials and methods

Brassica oleracea BAC sequencing

BAC clones B47M9, B67C16, B77C13, B59J16 and B16J1 originate from B. oleracea var. italica (broccoli) doubled haploid ‘Early Big’ library (Gao et al. 2004). These clones were selected because they harbor major genes in the aliphatic glucosinolate pathway. Clones B47M9, B67C16, B77C13 and B16J1 were outsourced for sequencing 454 Life Sciences (Bradford CT) using pyrosequencing. B59J16 was sequenced at the CA&ES Genomics Facility (CGF) in the UCD campus following traditional techniques (Gao et al. 2004). Gaps were filled by a combination of primer walking and shotgun sequencing of sub-clones at both the sides of gaps. The summary of assemble details is shown in Supplementary Table 1. Final error rate was estimated using CONSED which is less than 1 bp per 100 kb. These sequences were deposited in GenBank under the following accession numbers: B16J1 (EU579454), B67C16 (EU581950), B77C13 (EU579455), B47M9 (EU673963) and B59J16 (EU568372). For comparison, we added to the analysis, sequences of three other BACs of the same broccoli variety, B21H13 (Gao et al. 2004), B19N3 (Gao et al. 2005) and B21F5 (Gao et al. 2006). Additionally, we included B. oleracea contigs in the analysis sequenced by Town et al. (2006) by comparing them to their corresponding B. rapa sequences in the public domain: BAC clones KBrH015M19 (AC172876), KBrB077F22 (AC189466), KBrB063K02 (AC189420), KBrH1070K21 (DQ369749), KBrH093K03 (AC155347), KBrB021P11 (AC189261), KBrS005L11 (AC189638), KBrH077A05 (AC155343) and KBrB080C12 (AC189471). The same method was used to analyze the sequences of these BACs.

Sequence analysis and gene-prediction

The BAC sequence was analyzed for protein-coding genes with the following gene-prediction of A. thaliana software: GenScan (Burge and Karlin 1997) and TwinScan, by comparing conserved regions in the DNA of both species (Flicek et al. 2003). The sequence of BACs was aligned with its corresponding A. thaliana sequences with BLAST 2.2.9 (Altschul et al. 1997). The BAC sequence was also compared to Arabidopsis, Brassica, and Oryza sativa ESTs, cDNAs, and CDS using BLAST and FASTA with NCBI, AGI and TIGR database (www.tigr.org/tdb/e2k1/bog1/) to analyze gene conservation. The conserved regions were translated into protein and tBLASTn applied to the GenBank protein database to adjust exon–intron boundaries (05/01/2008). The transposable elements (TE) in the sequences were predicted and located with the program “RepeatMasker” (A.F.A. Smith and P. Green, unpublished data, http://www.repeatmasker.org/) and BLASTN and BLASTX searches to the GenBank database to find by comparison all types of reported transposable elements (05/01/2008). The ‘bases masked’ number is calculated from the total number of basepair masked sequences. The ‘bases masked’ include the retroelements, DNA Transposons, low-complexity DNA and simple repeats.

Results

Annotation of the five B. oleracea BACs harboring GSL genes

The five B. oleracea BACs B67C16, B47M9, B77C13, B59J16 and B16J1 sequenced were selected because they harbor a major aliphatic glucosinolate (GSL) gene BoGS-OH (At2g25450), BoCS-lyase (At2g20610), BoCYP79F1 (At1g16410), BoGSL-PROb (At1g18500) and BoS-GT (At1g24100), respectively. The annotation of the genes in these BACs is shown in Supplementary Tables 2, 3, 4, 5 and 6. The other three BAC clones previously sequenced and analyzed include B21H13 harboring BoGSL-ALKa and BoGSL-ALKb, B19N3 harboring BoGSL-ELONG and BoGSL-ELOG-L and B21F5 harboring BoGSL-PRO (Gao et al. 2004, 2005, 2006).

Characteristics of B. oleracea and B. rapa sequences in relation to A. thaliana

A total of 2,946 kb of B. oleracea sequence, including all eight broccoli BAC clones and the contigs published by Town et al. (2006), 1,069 kb of B. rapa sequence and 2,607 kb of corresponding A. thaliana sequence were compared and analyzed. Most of the comparative data could be generated between B. oleracea and A. thaliana, since corresponding B. rapa sequences were available only for half of the B. oleracea clones, B16J1, B21F5 B67C16. Table 1 shows a global summary of the features of these sequences for all three species. All the genes in the sequences of the three species were taken into account, however, only 48% of the B. oleracea and 71% of the B. rapa genes had counterparts in their corresponding A. thaliana sequences due to breaks in synteny. BLAST search in the B. oleracea and B. rapa databases for the absent genes indicated that at least 50% had homologs somewhere else in the genome (data not shown). The global gene comparison for the chromosomal segments compared revealed that on an average B. rapa genes were significantly longer (3,218 bp) than those in B. oleracea (2,721 bp) and the latter tended to be longer than in A. thaliana (2,310 bp), but the difference was not statistically significant. These differences were associated mostly to intron size, which was significantly larger in B. rapa compared to the other two species. Gene density was lowest for B. oleracea (one per 8.0 kb) followed by B. rapa (5.4 kb) and A. thaliana (one per 4.5 kb). This parameter was associated to gene spacer size which was larger in the Brassica species, B. oleracea in particular (Table 1). Taking into account only the genes conserving collinearity, gene size as well as exon size and number, were the same for all three species. However, intron size was different between A. thaliana and the Brassica species, but not between B. oleracea and B. rapa.

Table 1 Features of eight B. oleracea and four B. rapa BAC clones

Annotation of the eight B. oleracea BAC sequence (745.8 kb) resulted in the construction of a total of 94 gene models (Table 2). These include the updated annotation of BACs B19N3, B21F5 and B21H13 previously reported (Gao et al. 2004, 2005, 2006). Considering all the eight clones, we could classify them by gene density. B21H13 has the highest density with 23 gene models in 101.5 kb and B47M9 has the lowest density with eight gene models in 104.6 kb (Table 2). A total of 89 gene models were annotated in 495.7 kb sequence of four B. rapa BAC clones that could be partially aligned to the four B. oleracea BAC clones listed at the beginning of this section (Table 2; Fig. 1). The B. rapa BACs had higher gene density than the B. oleracea BACs, in agreement with the global sequence comparison summarized in Table 1. In order to get a better picture of the alignment of the corresponding sequences of A. thaliana and B. rapa, the BAC clones of the latter species were aligned to the physical map of A. thaliana (Fig. 1).

Table 2 Summary of features identified in eight B. oleracea BAC clones and four B. rapa BAC clones
Fig. 1
figure 1figure 1

The comparison map of B. oleracea and Brassica rapa BAC clones with A. thaliana. Open right arrow DNA transposons, filled right arrow retroelements, open rectangle gene fragments. Vertical lines indicates sequence contigs. The triangle by each gene model name indicates the coding strand of the gene

Annotation of B. oleracea contigs A B, C, D and G, covering a total of 1,518 Kb, resulted in the construction of 378 gene models (Table 3). Five B. rapa BAC clones could be partly aligned with these five B. oleracea contigs. A total of 60 gene models were annotated in 574 kb sequence of these corresponding B. rapa BAC clones (Table 3). Contrary to the trend found for the B. oleracea BAC clones, higher gene density were observed in the contigs for this species (4.0 kb) than in A. thaliana (4.6 kb) and B. rapa (5.6 kb).

Table 3 Summary of features identified in seven B. oleracea contigs and five B. rapa BAC clones

All of these Brassica contigs have a high level of DNA sequence conservation with their counterparts in A. thaliana. One hundred and thirty nine and 187 A. thaliana gene models were identified in the corresponding region of eight B. oleracea and four B. rapa BAC clones, respectively (Fig. 1). Two hundred and thirteen and 136 A. thaliana gene models were identified in the corresponding region of five B. oleracea contigs and another five B. rapa BAC clones, respectively (Fig. 1).

DNA sequence conservation and collinearity between Brassica and Arabidopsis

In general, collinearity in the sense of finding corresponding genes in the same order and orientation was high among all the three species. In the eight B. oleracea BACs, 88% of the genes (83 of 94) conserved order with 139 A. thaliana genes in their corresponding regions (Table 2; Fig. 1). Sixty nine percent of B. rapa genes (63 of 91) conserved order with 187 A. thaliana genes in their corresponding regions (Table 2; Fig. 1). However, gene content was often different among all the three species in corresponding segments, due to frequent interspersed gene absence in the Brassica species in relation to Arabidopsis (Figs. 1, 2). Thus, when one considers in the comparison all the genes in the corresponding segments, the collinearity drops significantly to 60% between B. oleracea and A. thaliana and 33% between B. rapa and A. thaliana (Table 2). These values were higher for the five B. oleracea contigs, 70% (144/213) of the genes conserved collinearity to A. thaliana and 52% (73/136) for their corresponding B. rapa BAC sequences (Table 3; Fig. 2).

Fig. 2
figure 2

The comparison map of five B. oleracea contigs and five Brassica rapa BACs clones with A. thaliana

In a few instances genes present in chromosomal segments of Brassica species were absent in corresponding segments of A. thaliana. For example comparing B. oleracea with A. thaliana, genes A09–A11 on B16J1, B09 on B19N3, C03 and C09 on B21F5, and H03, H05, H09 and H10 on B77C13 are absent in A. thaliana.

Often the genes showing the syntenic changes were flanked by unrelated partial genes (Fig. 1). In fewer instances, the changes were associated to chromosomal rearrangements, such as is the case for genes B01–B03 in B19N3. In a few cases, tandem duplicates could be observed in B. oleracea, for example genes D06a and D06b, H01a and H01b. A similar situation was observed when B. rapa and A. thaliana were compared. Clone KBrB063K02 corresponding to chromosome II of A. thaliana had two homologs inserted from chromosomes V and III, respectively and two from homologs on chromosomes IV and I, respectively (Fig. 1). Homologs corresponding to contiguous genes At1g15690 and At1g15700 were segmentally duplicated in B. rapa, a few genes downstream the original location next to a retroelement.

We could compare B. oleracea and B. rapa for segments corresponding to four sets of BAC clones. The general theme of missing genes in either B. rapa or B. oleracea could be observed as the main cause for collinearity disruptions. For B. oleracea BAC B16J1, there was almost perfect colinearity with B. rapa, spanning from the homologs of At1g24030–At1g24140, except for At1g24050 that is missing in B. oleracea. Four other homologs next to At1g24050, which includes At1g24060–At1g24080, were missing in both the Brassica species. Similar to A. thaliana, B. oleracea genes A09, A10 and A11, which are homologous to genes of At2g32430, At1g70140 and At1g67020, respectively, were missing in the corresponding segments of B. rapa (Fig. 1). Collinearity for the segment corresponding to B21F5 was almost identical in both species, except for the presence of gene C03, corresponding to homolog (At2g13865) and a partial gene in B. oleracea and absent in both B. rapa and A. thaliana (Fig. 1). For BAC clones B67C16, it has higher collinearity with its corresponding sequence in A. thaliana than with that of B. rapa. Genes G03 and G04 were absent in B. rapa, whereas the homolog corresponding to gene At2g25460 was missing in B. oleracea, and the segment spanned by genes G05–G08 in B. oleracea was absent in B. rapa. At least four other segments present in A. thaliana were absent in B. rapa, At2g25540–At2g25570, At2g25580 and At2g25590, At2g25605 and At2g25610, and At2g25625–At2g25670 (Fig. 1). No B. oleracea sequence was available to tell whether these segments were also absent in the corresponding segment for this species. The number of absent genes in B. rapa for the segment corresponding to B. oleracea BAC clone was very extensive. B. oleracea genes H01, H03, H05 H09 and H10 were absent in B. rapa (Fig. 1). Additionally, by comparing the corresponding segments between B. rapa and A. thaliana, two blocks of genes were missing in the former species, At1g16260–At1g16320 and At1g16340–At1g16370 (Fig. 1).

Transposable elements

We detected 95 TEs in the eight B. oleracea BACs corresponding to a masked base percentage of 13%, whereas in B. rapa these numbers were much lower, 30 and 6.4%, respectively (Table 4). The TE density in the B. oleracea BACs was 0.13 (95/746 kb), whereas in the corresponding segments of B. rapa was 0.06 (30/495 kb).

Table 4 Transposable elements of eight B. oleracea and four B. rapa BAC clones

Also, we detected 243 TEs in the five B. oleracea contigs corresponding to a masked base percentage of 18%, whereas in B. rapa these numbers were much lower, 34 and 13%, respectively (Table 5). The TE density in the B. oleracea contigs was 0.16 (243/1518 kb), whereas in the corresponding segments of B. rapa it was 0.12 (69/574 kb).

Table 5 Transposable elements of five B. oleracea contigs and five B. rapa BAC clones

In the B. oleracea BACs the percentage of retroelements (class 1 TEs) is 61%, and of DNA transposons (class 2 TEs) is 39%. The opposite is true for B. rapa where the percentage of retroelments is 34 and 66% for DNA transposons. For the B. oleracea contigs, we found similar frequency of these types of elements, class 1 TEs is 65% and class 2 TEs is 35%. However, for the corresponding B. rapa segments of the B. oleracea contigs, the percentage for class 1 and 2 TEs were nearly the same (51 and 49%).

In the total B. oleracea sequence analyzed, there are 0.7 TEs per gene and 0.15 TE per 1 kb of sequence, and 218 masked bases per 1 kb of sequence. In the B. rapa BACs there are 0.5 TEs per gene and 0.09 TEs per 1 kb of sequence, 96 masked bases per 1 kb of sequence. The number and masked base percentage of class 2 TEs is more than the class 1 TEs in both Brassica species (Table 6). Sixteen percent of B. oleracea sequence, 10% of B. rapa sequence and 4% of A. thaliana sequence corresponds to TE.

Table 6 Distribution of transposable elements in Brassica

The LTR elements were the main type in the retroelements in both Brassica species. The En-spm type is predominant in the DNA transposons of B. oleracea and the hAT type is predominant in B. rapa.

In the eight B. oleracea BACs, 17 transposable elements, 13 DNA transposons and four retroelements were inserted into genes. Only one retroelement inserted into a B. rapa gene, which is the ortholog for At1g16170 (Fig. 1).

The insertion of TEs was not frequently associated to chromosomal segments displaying breaks in synteny among species. Thirty one percent of TEs in eight B. oleracea BACs and 49% in the contigs were inserted into regions maintaining collinearity with A. thaliana, whereas 23% (7/30) of TEs were inserted into regions maintaining collinearity in four B. rapa clones with A. thaliana (Table 2; Fig. 1).

Little conservation of transposable elements insertions was observed among the three species. Only one SINE type TE with same sequence was found in the corresponding location in B. oleracea contig B and B. rapa BAC KBRH093K03. This TE is of the AtSB6 type and has 68 bp.

Discussion

Differential gene content for corresponding segments in the three species

Most of the comparative genomics work done to date among Brassica species with reference to A. thaliana, are based on physical and genetic mapping procedures (Rana et al. 2004; Parkin et al. 2005; Park et al. 2005). Comparative sequencing of specific chromosomal regions provides useful new information on gene density, synteny and conservation of gene collinearity along these segments (Gao et al. 2004, 2005, 2006; Yang et al. 2006; Town et al. 2006). In general, gene density for the chromosomal segments studied was highest for A. thaliana, followed by B. rapa and B. oleracea. The limited sequencing data of B. rapa available for this study did not allow us to ascertain orthology with B. oleracea BAC clones B47M9, B67C16, B77C13, B59J16 and B16J1. However, the conclusion that can be reached from our survey is that for the chromosomal segments studied, gene density is lower in the Brassica species than in A. thaliana. In terms of genome expansion, Brassica/Arabidopsis sequence length ratios for the B. oleracea sequence analyzed is 1.7, ranging from 0.86 to 3.2 (Tables 2, 3). However, gene density is not uniform across the genome where regions of higher and lower gene density might co-exist, such as is the case for the region covered by BAC clone B21H13. A similar situation is observed for B. rapa, although gene density in this case is higher than in B. oleracea (Tables 2, 3). Lower density in the Brassica species is associated to larger introns and spacers and to extensive gene rearrangement resulting in the absence of genes in otherwise collinear chromosomal segments with A. thaliana. The fact that approximately 50% of the genes absent in the compared segments which can be accounted for the Brassica data bases (which are incomplete), indicates that most of these genes have not been lost and are somewhere else in the genomes of B. rapa and B. oleracea. Due to these rearrangements, the breaks in synteny between Brassica and A. thaliana can be quite extensive, depending on the chromosomal segment compared. This is also true between the two Brassica species, which evolutionarily are considered to be in the same lineage (Warwick and Black 1991). This is in agreement with the results of Parkin et al. (2005) who estimated 74 gross rearrangements taking place between the A and C genome chromosomes of B. napus. Using RFLP markers, they were able to identify 21 conserved regions of A. thaliana duplicated and rearranged in the A and C genome chromosomes of B. napus. These conserved segments were on an average 9 Mbp in length. Physical mapping studies by Park et al. (2005) comparing specific chromosomal segments for all three species; report also breaks in synteny mostly due to gene absence. The earlier report of Kowalski et al. (1994) had already suggested that conservation of synteny and gene content between A. thaliana and Brassica was limited to specific segments or genomic islands. However, as additional comparative sequencing data are accumulated, it is evident that these conserved islands are small. Furthermore, in spite of their phylogenetic close proximity, the genomes of B. oleracea and B. rapa have also undergone extensive structural changes resulting in segmental conservation of collinearity. Most of the changes observed are species specific, including gene duplications. Thus, these changes have taken place after the separation of the oleracea and rapa lineages.

Differential TE content in B. oleracea and B. rapa

For the segments analyzed, we found a lower frequency of TEs in B. rapa than B. oleracea, which is in agreement with the smaller genome size of the former species. We estimated that for these segments, approximately 16% of B. oleracea sequence and 10% of B. rapa sequence consist of TEs, which is not far from a global TE estimate of 20% for B. oleracea by Zhang and Wessler (2004) and Katari et al. (2005). Of these, approximately 14% correspond to retroelements and 6% to DNA transposons. For B. rapa, based on 60 Mb BAC end sequences, 12.3% of the sequences consist of TE sequences, of which 84% are retroelements and 11.4% are DNA transposons (Lim et al. 2006). Our estimate is also in agreement with this report. Considering that the TE content in A. thaliana is only 4% (Zhang and Wessler 2004) the accumulation of these elements has taken place after the separation of the Arabidopsis and Brassica lineages. Alix and Heslop-Harrison (2004) have analysed the diversity of retroelements in diploid and allotetraploid Brassica species, where there is a distinct clustering of Copia-like retroelements in C genome much more than the A and B genome. This result is in agreement with our observation, 83 bp retroelements in 1 kb B. oleracea sequence is more than the 27 bp in B. rapa.

When comparing TE insertions between B. oleracea and B. rapa, we found little conservation of TE elements. Only one insertion was shared between the two species. Therefore, it is evident that each species have followed their own path of TE acquisition, where the rate of accumulation of these elements has been higher in B. oleracea than in B. rapa.

Based on previous reports and our work, TE elements are ubiquitous in Brassica species and have an important role in genome evolution. Most likely current estimates for these elements will increase as progress is made on sequence annotation. For example, Lim et al. (2007) after analyzing close to 88,000 BAC clones have found that retrotransposons are major components of centromeres and peri-centromeric regions in most Brassica species. One can estimate the contribution of these elements as part of the Brassica genomes as follows: B. oleracea has a DNA content of 696 Mbp, of which 20% (139 Mbp) are TEs. Assuming an even distribution of DNA in all nine chromosomes, each has then approximately 77 Mbp of DNA. Thus, the increase in DNA by TEs in B. oleracea is equivalent to adding two chromosomes to its genome. If the ancestral Brassica lineage had 2n = 4x = 16 (Henry et al. 2006), we will have to add another eight chromosomes to produce an hexaploid (Lysak et al. 2005; Parkin et al. 2005; Ziolkowski et al. 2006; Yang et al. 2006), which presumable would then have 2n = 6x = 24. Therefore, a simpler explanation is to assume that the Brassica lineage diverged from the tetraploid ancestral lineage proposed by Henry et al. (2006) after insertion of TE elements, segmental duplications and chromosomal rearrangements resulting from hybridization events. This is a more likely scenario to explain the observed regional triplication than invoking another round of polyploidization, followed by massive chromosome loss to return to the existing chromosome numbers of 2n = 16–20, characteristic of the monogenomic Brassica species.