Background

Brassica oleracea (CC genome, 2n = 18) is one of the most important species in the family Brassicaceae, which also contains the model species Arabidopsis thaliana and a great number of nutrition-rich vegetables and oilseed crops, such as B. rapa (AA, 2n = 20), B. nigra (BB, 2n = 16), B. napus (AACC, 2n = 38), B. carinata (BBCC, 2n = 34) and B. juncea (AABB, 2n = 36) [1]. Brassica oleracea is a very morphologically diverse species that includes common heading cabbage (B. oleracea ssp. capitata L.), cauliflower (B. oleracea ssp. botrytis L.), broccoli (B. oleracea ssp. italica L.), kohlrabi (B. oleracea ssp. gongylodes L.), kale (B. oleracea ssp. medullosa Thell.), and Brussels sprouts (B. oleracea ssp. gemmifera DC) [2]. This intriguingly broad variation provides an excellent model for studying biological functionality and morphological evolution using the modern tools of molecular evolutionary biology and comparative genomics [3, 4].

The A. thaliana genome has undergone two whole genome duplication events (α and β) within the crucifer lineage and one more ancient genome triplication event (γ) shared with most dicots (asterids and rosids) [5]. The Brassica and Arabidopsis lineages diverged from a common ancestor about 20 million years ago (MYA) after the α events [6], and a whole genome triplication event occurred subsequently in the Brassica ancestor 13–17 MYA [7]. The two representative Brassica diploids, B. rapa and B. oleracea, separated from each other about 3.75 MYA [8]. The genetic system of Brassica species, particularly of those described by the "triangle of U" (the relationship between three diploids and three synthetic tetraploids) [1], provides an unprecedented opportunity to study inter-species hybridization, polyploidization, genome evolution and its role in plant speciation. The genome of B. rapa (A genome) has been sequenced and made available in the BRAD database [9]. Recently, we finished the genome assembly of B. oleracea (C genome) and submitted the data to NCBI. These primary genomic data will facilitate structural, functional, and evolutionary analyses of Brassica genomes, as well as those of other Brassicaceae.

There now exist several public databases for B. oleracea genome sequence data, including Brassica Genome Gateway (http://brassica.bbsrc.ac.uk/), Brassica.info (http://www.brassica.info/resource/databases.php), and AAFC Comparative Genome Viewer (http://brassica.agr.gc.ca/navigation/viewer_e.shtml). These databases present only partial genomic data for B. oleracea, such as QTLs, ESTs and cloned genes. To better access, search, visualize, and understand the genome sequences, annotation, structure, and evolution of the B. oleracea genome, we developed a comprehensive web-based database, Bolbase (http://ocri-genomics.org/bolbase), which include genome sequence data and comparative genomics information. This user-friendly database will serve as an infrastructure for researchers to study the molecular function of genes, comparative genomics, and evolution in closely related Brassicaceae species as well as promote advances in molecular breeding within Brassica (Figure 1).

Figure 1
figure 1

Schematic illustration of the Bolbase sitemap.

Construction and content

The genome of B. oleracea capitata (line 02–12) was sequenced by next generation sequencing technologies combined with 454 and Sanger sequencing. In total, a 540-Mb draft assembly, representing 85% of the estimated 630-Mb genome, was generated and submitted to NCBI. In Bolbase, we collected the complete sequence assembly, including nine pseudomolecular chromosomes, 1,848 scaffolds, and all genome components, comprising 45,758 predicted protein-coding genes, 13,382 transposable elements, and 3,581 non-coding RNAs. For each annotated genomic component, we supplied detailed annotations and cross-links to publicly available databases. Moreover, we provided a comprehensive analysis of synteny among B. oleracea, B. rapa, and A. thaliana using data from BRAD (http://brassicadb.org/brad/, v1.0) [9] and TAIR (http://www.arabidopsis.org, TAIR9) [10], respectively.

Genomic component

A total of 45,758 predicted genes with annotations were collected in Bolbase (Table 1). Putative genes with a variety of architectonic types, such as gene families, orthologous groups, and tandem arrays, and their locations on pseudo-molecular chromosomes and scaffolds were included in Bolbase. Each putative gene was annotated using public databases or web service sites to obtain a comprehensive functional overview (Figure 2). A total of 13,382 transposable elements in B. oleracea were deposited in Bolbase, including 2 major classes: retrotransposons (Class I transposons) and DNA transposons (Class II transposons). Additional categories, such as long terminal repeat retrotransposons (LTR-RTs), long interspersed nuclear elements (LINEs), short interspersed nuclear elements (SINEs), Tc1-Mariner, hAT, Mutator, Pong, PIF-Harbinger, CACTA, Helitron, and miniature inverted repeat transposable elements (MITEs) were hierarchically listed. Moreover, information on different superfamilies and families of LTR-RT elements was also provided. Bolbase compiled 3,581 non-coding RNAs by their conserved motifs and sequence similarities: 312 microRNAs (miRNAs), 517 ribosomal RNAs (rRNAs: 18S, 28S, 5.8S, and 5S), 1,434 small nuclear RNAs (snRNAs: CD-box, HACA-box, and splicing), and 1,318 transfer RNAs (tRNAs).

Table 1 Comparison of predicted protein-coding genes in Brassica oleracea, Brassica rapa, and Arabidopsis thaliana
Figure 2
figure 2

Annotation of predicted protein-coding genes in the Brassica oleracea genome. A. basic information; B. protein sequence features; C. gene clusters, including orthologous groups and tandem duplicated arrays; D. syntenic analysis, including orthologous genes, syntenic regions and triplicated blocks in B. rapa and A. thaliana; E. the orthologous genes of Bol007288 in A. thaliana (AT5G06860 and AT3G12090); F. the orthologous genes of Bol007288 in B. rapa (Bra038699 and Bra000594).

Gene clusters

Clusters of genes with similar functions evolve through tandem, segmental, or whole genome duplication and are remarkably important for genome evolution and trait establishment. The gene cluster section in Bolbase is composed of gene families, orthologous groups, and tandem duplicated arrays. First, HMMER v3.0 software was employed to detect gene family members using HMM profile from the Pfam database [11, 12]. Second, OrthoMCL 2.0 software was used to classify orthologous groups with E-value ≤ 1e-05 and inflation parameter of 1.5; all B. oleracea genes were divided into 21,509 ortholog groups [13]. Third, tandem duplicated genes were classified using the BLASTP program with E-value cutoff ≤ 1e-20 where one unrelated gene within a tandem array was allowed. Approximately 1,825 tandem arrays with 2 to 12 genes each were detected and saved in Bolbase.

Syntenic regions

To better understand evolutionary history and species divergence, syntenic regions between A. thaliana and Brassica species were identified using the MCscanX software and manual curation, and they can be visualized and used in Bolbase [14] (Figure 3). Orthologous gene pairs were first identified based on an all-against-all BLAST search with an E-value cutoff ≤ 1e-10 between species from best-reciprocal BLAST hits [15]. Then, MCscanX was employed to identify syntenic regions, using the parameters e = 1e-20, u = 1, and s = 5, which required a minimum of five consecutive orthologous gene pairs in the collinear regions. In total, 558 syntenic regions, including 22,413 gene pairs, were classified between B. oleracea and A. thaliana, and 1,034 syntenic regions containing 24,422 gene pairs were defined between B. oleracea and B. rapa. These data can be freely accessed and visualized (Table 2, Additional file 1). Moreover, nonsynonymous (Ka) and synonymous (Ks) substitution rates of orthologous gene pairs were calculated and provided.

Figure 3
figure 3

Syntenic regions of Brassica oleracea chromosome C01 and the Arabidopsis thaliana genome. As an example, B. oleracea chromosome C01, which contains 55 syntenic regions, was compared to the genome of A. thaliana. The hyperlinks under 'Region’ or 'Mapped Region’ will visually present the syntenic relationship between the two genomes. The hyperlinks under 'Detail’ will retrieve orthologous gene pairs in the syntenic regions and calculate their Ka/Ks values and divergence times.

Table 2 Syntenic regions on pseudomolecular chromosomes in Brassica oleracea, Brassica rapa, and Arabidopsis thaliana

Utility

Bolbase provides a user-friendly interface to facilitate the retrieval of information. Five main functional units —browse, synteny, search, document, and help — were integrated into Bolbase. From those units, users can browse genomic and comparative genomic information for B. oleracea and its relatives or retrieve comprehensive genomic component annotations, their locations on pseudomolecular chromosomes, and genome sequences. These genomic data can also be downloaded in bulk. Therefore, Bolbase will facilitate studies on genome variation and genomic structure differentiation within and between species. Here we describe some main functions of the interface.

Browsing genomic components and syntenic regions

The genomic component web interface of Bolbase is organized by component type. Each of the main navigation tabs focuses on a specific component to allow users to retrieve information from the database. This functional unit is contained in "Browse" on the main navigation bar. The putative gene tab is organized by gene families, orthologous groups, tandem arrays, and gene locations on pseudomolecular chromosomes or scaffolds. Repeat element and non-coding RNA tabs are organized by types, categories, or superfamilies. IN particular, Bolbase provides detailed function annotations for every putative gene that can be divided into four units: (i) basic information (Figure 2A); (ii) protein sequence features (Figure 2B); (iii) gene clusters, including orthologous groups and tandem duplicated arrays (Figure 2C); and (iv) syntenic analyses including orthologs in B. rapa and A. thaliana, as well as corresponding syntenic regions and triplicated blocks (Figure 2D). Basic information consists of gene identifier, location, model structure (intron/exon boundary, number, length, etc.), and coding nucleotide and protein/peptide sequences. The unit of protein sequence features displays conserved protein domains or motifs predicted by InterProScan in detail [16]. Additionally, putative genes were also annotated and compared with different databases, including Gene Ontology (GO) [17], Swiss-Prot [18], TrEMBL [18] and Kyoto Encyclopedia of Genes and Genomes (KEGG) [19].

To better visualize the collinear relationship between species, the syntenic regions in B. oleracea, B. rapa, and A. thaliana are visualized on chromosomal images produced by Perl scripts, and statistical analyses of gene pairs between species are also scatter plotted. The syntenic regions between any target chromosome and those of other species will appear when the chromosome is selected, revealing gene pairs in each region and their Ka, Ks and Ka/Ks values.

Keyword search

The keyword search is a powerful search engine to retrieve useful information, such as sequences, annotations, and homologous genes. These functional units are contained in the "Search" section on the main navigation bar. This section mainly includes putative gene, transposable element, and non-coding RNA search pages. Putative gene searching will provide users with detailed annotations, orthologous genes, and/or tandem arrays, if they exist. By inputting a GO term, a InterPro entry, or a KEGG pathway entry, researchers can retrieve a group of putative genes in the B. oleracea genome. Different types, categories, and superfamilies of transposable elements can be screened in the transposable element search page. The non-coding RNA search page is designed to help users compile information on these genetic elements. The different types or categories of non-coding RNA can be also searched on this page.

Orthologous genes and syntenic regions search

Through comparative analyses among species, researchers can further understand the genomes of B. oleracea and its relatives. Orthologous genes in conserved syntenic regions can be displayed using a localized GBrowse_syn software by inputting a gene name, as indicated in Figure 3[20, 21]. This functional unit is contained in the "Search" section on the main navigation bar. Here, we use the B. oleracea gene Bol007288 as an example to show orthologous genes in related species. By searching with Bol007288 as query on the orthologous genes search page, two orthologs in A. thaliana (AT5G06860 and AT3G12090) and two in B. rapa (Bra038699 and Bra000594) are retrieved (Figure 2E,F). By selecting a chromosome from one species, syntenic regions in the other species can be visualized as a comparative chromosomal image, and lists of syntenic regions are displayed with their chromosomal positions. When the hyperlink for the target region is clicked, the syntenic regions in other species will be displayed.

Sequence similarity search

The similarity search page, which embeds customized BLAST software, will satisfy users with various interests related to homologous genes or regions. This functional unit is contained in the "Search" section on the main navigation bar. Users can supply a nucleic acid or amino acid sequence by uploading or directly pasting it to search against the available databases. Thus, this function allows quick comparisons and annotations of user query sequences using the data deposited in Bolbase. BLAST hits return with hyperlinks to the genes, enabling users to quickly acquire annotations from the database.

Discussion

Although a few Brassica databases existed previously, Bolbase is the first comprehensive database with a focus on the B. oleracea genome and comparisons with its relatives. The deposited sequences and relatively accurate annotations will allow users to retrieve and download important information to further their interests in both functional and comparative genomics studies. Compared to other databases of B. oleracea genomic data, Bolbase supplies more detailed genomic annotations from public databases to allow users to analyze them more thoroughly. Syntenic regions and orthologous genes, which are useful resources for comparative and evolutionary analysis, can be explored in a highly visual style. Additionally, the user-friendly interface provides users quick and comprehensive information. The friendly and powerful search tools allow multi-channel searching and will be improved in the future based on user feedback. We continue to update and expand the database by adding data from other Brassica species as they become available.

Conclusions

We have developed Bolbase, a comprehensive and searchable database of the B. oleracea genome. Bolbase is the primary resource platform for the B. oleracea genome and for genomic comparisons with its relatives, and its functions are not available in other public databases of Brassica species. To assist researchers and breeders in using the B. oleracea genomic information efficiently, Bolbase will be regularly updated with new genome annotations and the results of comparisons with newly-sequenced genomes as they become available. We hope that Bolbase will provide a valuable resource for the study of the functional and evolutionary aspects of Brassica genomes and for further exploration of the evolutionary relationships within the Brassica genus and the crucifer lineage.

Availability and requirements

Database: Bolbase.

Database homepage:http://ocri-genomics.org/bolbase.

Operating system(s): Linux.

Programming language: Perl, Python, JavaScript.

Other requirements: Apache, PHP, MySQL, GD, SVG, GBrowse.

These data are freely available without restrictions for use by academics. Please login to the 'Help’ page on the Bolbase homepage or email Dr. Shengyi Liu (liusy@oilcrops.cn) to request data subsets of interest.