Introduction

Next-generation sequencing (NGS) platforms rapidly sequence entire genomes in the form of millions of short DNA fragments in a cost-effective and high-throughput manner. NGS has been widely applied in plant and animal genomic de novo sequencing (Al-Dous et al. 2011; Li et al. 2010; Xu et al. 2011), resequencing (Ashelford et al. 2011; Lam et al. 2010; Xia et al. 2009), methylation sequencing (Cokus et al. 2008; Lister et al. 2008, 2009; Baranzini et al. 2010; Dowen et al. 2012) and transcriptome and small RNA sequencing (Buggs et al. 2012; Strickler et al. 2012) for a plethora of functional and comparative genomics analyses. In complex polyploid species, including many of the world’s most agronomically important crops, genomics may be hampered by poor discrimination between homeologous fragments during sequence assembly. However, advances in NGS technologies and the associated computational algorithms required to analyze massive, complex data sets present huge potential for complex genome analysis and development of improved genomics-based breeding strategies (Aversano et al. 2012).

The Brassica genus contains the most diverse collection of agronomically important plant species and is a relative of the model plant Arabidopsis thaliana, from which it diverged ~20 Mya. The six most agro-economically important Brassica species include the three diploid species, Brassica rapa (AA, 2n = 20), Brassica oleracea (CC, 2n = 18) and Brassica nigra (BB, 2n = 16), and the three allotetraploid species, Brassica juncea (AABB, 2n = 34), Brassica napus (AACC, 2n = 38) and Brassica carinata (BBCC, 2n = 36), which were formed through the hybridization of their diploid genome counterparts (U N 1935). World-wide, approximately 12 % of edible vegetable oil is provided by B. napus, B. rapa, B. juncea and B. carinata, and the global production of Brassica crops has doubled in the last 15 years. Given that many Brassica crop species are polyploids, the well-studied relationships between the diploid progenitor species and their corresponding tetraploids, combined with relatively small genomes, make Brassica species highly useful models for investigating the genetic and evolutionary mechanisms of polyploidisation (Edwards et al. 2013).

The genome of diploid and tetraploid Brassica species (1.2 Gbp) are 3–5 times and 10 times greater than that of A. thaliana, respectively (Arumuganathan and Earle 1991). Genome duplication combined with gene loss and insertion (Town et al. 2006; Mun et al. 2009), chromosomal rearrangements, and rapid divergence of repeat sequences (Koo et al. 2011) are frequent effects of polyploidization, which has complicated Brassica genome structure. As such, the complex structure of polyploid Brassica genomes has made genome sequencing and assembly challenging. Fortunately, with advances in technologies and software, dissecting polyploid genomes has become possible. The current trend towards the application of NGS platforms in numerous plant and animal species will profoundly affect genomic research, improving our understanding of molecular and evolutionary mechanisms underlying variations in a species’ phenotype, development and responses to environmental stressors.

This article will compare various NGS platforms and discuss advances and trends in their use with emphasis on their application and challenges in Brassica research and breeding. In so doing, this review will highlight the impacts of NGS in Brassica research as a model system for understanding the molecular basis of polyploidy in important crop species.

Characteristics and comparative analysis of various generations of sequencing technologies

First-generation sequencing

First-generation sequencing is based on the chain termination method invented by Frederick Sanger (Sanger et al. 1977). Today, this method has been automated and commercialized using sequencing machines available through companies including Applied Biosystems (USA) and Beckman Coulter (USA). Sanger sequencing has dominated the DNA sequencing industry for almost three decades, providing sequence reads between 450 and 1,000 bp in length at high accuracies (99.999 %). The human genome (International Human Genome Consortium 2004) and the genome of model plant species, A. thaliana, followed by two rice varieties: Oryza sativa L. ssp. japonica and Oryza sativa L. ssp. indica have been completed (Goff et al. 2002; Yu et al. 2002). More recently, the genomes of Brachypodium distachyon (The International Brachypodium Initiative 2010), Populus trichocarpa (poplar; Tuskan et al. 2006), Prunus persica (peach; http://www.rosaceae.org/peach/genome), Sorghum bicolor (sorghum; Paterson et al. 2009), and Glycine max (soybean; Schmutz et al. 2010) were sequenced (Table 1).

Table 1 Plant species that have been sequenced using sequencing technologies

For Brassica species, the first physical map of Brassica A genome of B. rapa was constructed based on 67,468 BAC clones, spanning 717 Mb in physical length, fingerprinted by Sanger sequencing (Mun et al. 2008). The B. rapa A3 chromosome (31.9 Mb) was also obtained using traditional Sanger sequencing methods, incorporating 348 overlapping BAC clones (Mun et al. 2010).

Although first-generation sequencing enabled initial whole-genome sequencing efforts, significant limitations in Sanger sequencing technology exist. Firstly, conventional Sanger sequencing is still relatively expensive, currently 0.5 per kilobase, and while this cost has declined over the last few years (Snowdon and Luy 2012), it would cost $5,000,000 to sequence a 1 Gb genome to 10× coverage. Secondly, Sanger technology is difficult to improve upon since it depends on capillary electrophoresis separation of fluorescently labeled fragments (Varshney et al. 2009). Thirdly, the throughput (~70 Kbp of data per run) is exceedingly low, making it time-intensive to obtain genetic information, particularly for large numbers of samples in parallel. Thus, Sanger sequencing technology continues to be beneficial and widely applied for low throughput projects, for example, sequence validation of PCR products and BAC end sequences, but is not efficacious for mainstream genome and transcriptome sequencing, particularly of complex genomes. Such larger-scale studies require higher throughput technologies. As such, sequencing technologies able to multiplex numerous samples in parallel to obtain vast quantities of genetic information were more recently developed, collectively called NGS.

Next-generation sequencing

With advancements in molecular research and the increasing demand for large quantities of nucleotide sequences came the development of fast, high-throughput and cost-effective NGS technology. NGS technology has been widely applied in recent years and includes a few major products with different chemistries, including 454 sequencing (Roche Applied Science), Solexa (now Illumina) technology (Illumina inc.), SOLiD (Applied Biosystems), and Polonator G. 007 (http://www.polonator.org/) (Table 2).

Table 2 Comparison of different sequencing technologies

The first NGS sequencing instrument on the market was the GS20, developed by Roche Applied Science (USA). Recently, the Roche 454 GS FLX+ system was commercialized, yielding up to 700 Mb of sequence per run comprising 1,000 bp sequence reads with an accuracy of 99.997 %. This system includes the software required for assembly and mapping of these sequences reads into contiguous sequence, as well as amplicon variant analysis (Klein et al. 2011; Mundry et al. 2012). The Roche 454 GS has proven efficacy in Brassica genome sequencing (Wang et al. 2011b), and improves the accuracy of mapping homeologous fragments because the relatively long length of sequence reads increases specificity. Nonetheless, the Roche 454 GS remains costly as compared to other NGS technologies.

Solexa sequencing, now provided by Illumina (including the HiSeq and MiSeq instruments), can generate up to 200 bp per read and up to 600 Gbp of data per run. Error rates for this system are approximately 0.1 %, primarily resulting from substitution errors, rather than insertions or deletions in the sequencing and detection process (Minoche et al. 2011). Illumina sequencing is cost-effective and highly suited to the sequencing of short Brassica target sequences such as transcriptome sequences and small RNAs. However, its application as a sole NGS technology for whole-genome sequencing is limited by its relatively short-read lengths.

The ABI SOLiD platform (version 4) clonally amplifies template fragments with emulsion PCR, using DNA ligase rather than a polymerase and fluorescently labeled oligonucleotides. Another feature is the use of two-base encoding, which identifies each base twice to reduce the error rate. However, the speed of sequencing is relatively slow, and the read length is only around 85 bp (Edwards and Batley 2010). The Polonator G 007 is similar to the SOLiD system, and the technology platform is open for manipulation and improvement by the user. However, the current read length is less than 30 bp (Morey et al. 2013) and while vast quantities of data can be obtained from clonal amplification sequencing, read length is much shorter than that of other NGS platforms, reducing the applicability to sequence assembly and subsequent analysis in polyploid species including the amphidiploid Brassicas.

The third-generation sequencing

Single-molecule sequencing technology directly analyses light signals generated by cellular nucleic acids without a requirement for clonal amplification or ligation in template preparation. Deletions are the main cause of error using this sequencing technology. The HeliScope Single Molecule Sequencer is the first commercial product using this technology, which has been successfully applied to resequencing individual human genomes (Pushkarev et al. 2009). A method of single-molecule direct RNA sequencing without cDNA synthesis has also been established for transcriptome analysis in diagnostics (Ozsolak and Milos 2011); however, the read lengths (on average 30 bp) are relatively short.

Single-molecule real-time sequencing marketed by Pacific Biosciences in 2011 reads the sequence of DNA immobilized on a zero-mode waveguide (ZMW) reaction cell in real-time during polymerization (Eid et al. 2009). This yields approximately 3,000–20,000 bp read lengths, but is more prone to error than second-generation NGS technologies. Due to its high speed and long reads, this technique is highly promising for assembly of large, complicated genomes such as from Brassica species, provided that the error rates can be reduced.

Nanopore sequencing is a new technology to detect single molecules rapidly and directly as they pass through a nanoscale pore in a membrane, driven by an ion current. This can produce long read lengths of around 25 kb, or up to 5.4 kb in solid-state nanopores (Branton et al. 2008), and does not require pre-processing of templates, the use of polymerases or ligases or biochemical tags. Furthermore, if applied globally and successfully the cost of this technology is likely to be significantly reduced. The main challenge of this technology is the requirement for fast translocation speeds through the protein nanopore for accurate base reads. Oxford Nanopore Technologies has claimed that the first cost-effective nanopore sequencer will come to market later this year, which can sequence the human genome in 15 min.

Ion Torrent using semiconductor sequencing technology is launched by Life Technologies Corporation, which includes two platforms: Ion Personal Genome Machine (PGM) and Ion Proton. Current PGM 318 platform can produce about 1 Gb data in 2 h, and the length is up to 400 bp. It does not depend on the chemi-luminescence and optics. Undoubtedly continued advances in these third-generation technologies will occur rapidly allowing such technologies to be widely applied.

The choice of sequencing technology depends on the downstream application and is determined by the quantity and quality of the data output (read length, error rate, Gbp output), sequencing cost and time per run. The length of reads is particularly critical for organisms such as Brassicas, where the abundance of homeologous sequences hinders accurate genome assembly. Certainly, if the sequenced sequences are short, e.g., small RNA, many sequencing technologies could be selected, but specificity is also a major problem for RNA sequencing, otherwise non-specific short reads from different loci can map to a single locus and falsify expression quantities.

Complete genome sequencing projects using NGS

NGS technologies have contributed to the sequencing of complex genomes in the last few years, paving the way for future genomic and genetic research and crop improvement. The combination of traditional and NGS methods can currently provide a method for rapid and cost-effective sequencing of important plant species. For example, the cucumber genome (Cucumis sativus) was sequenced in 2009 with traditional Sanger (3.9×) and Illumina Genome Analyser (GA) NGS reads (68.3×). In this way, a total of 243.5 Mb was assembled and 26,682 genes predicted (Huang et al. 2009). The 353.5 Mb watermelon (Citrullus lanatus) genome was obtained at 108.6× coverage using the Illumina GAII system, and annotated with 23,440 genes (Guo et al. 2012a). Similarly, the barley (Hordeum vulgare; haploid content of 4.98 Gb) (Mayer et al. 2012) and sweet orange (Citrus sinensis; 367 Mb generated) genomes were sequenced using Illumina GAIIx technology (Xu et al. 2012b). The Illumina GA, Roche 454 and Sanger platforms were used to generate 844 Mb of the complex autotetraploid potato genome using a homozygous doubled-monoploid potato (Xu et al. 2011). A diploid cotton genome was sequenced on the Illumina HiSeq2000 platform at 103.6× resolution (about 775.2 Mb) (Wang et al. 2012b), however, given that most cotton cultivars are tetraploid, novel experimental and computational methods need to be applied for characterization of natural polyploids. Wheat has a complex and huge allohexaploid genome that includes numerous repetitive elements. To simplify genome sequencing and assembly in this species, isolated chromosome arms can be sequenced through NGS technologies, reducing the confounding effects of multiple homologous chromosomes (Berkman et al. 2012). Brenchley et al. (2012) has generated 17 Gb of the Chinese Spring (CS43) Triticum aestivum (bread wheat), genome with 454 pyrosequencing (5×), containing approximately 95,000 genes.

High-coverage and high-quality reference genome sequences are a base or a core of the foundation of omics investigations in of a species, especially polyploidy species. A multinational consortium for Brassica genome sequencing was initiated in 2003, with the initial aim to sequence the diploid B. rapa A genome using a BAC tiling path and Sanger technology. Following the development of NGS technology, the B. rapa (Chinese cabbage) v1.1 genome, including 41,174 protein coding genes, was released in 2011 by the international B. rapa Genome Sequencing Project Consortium (Wang et al. 2011b). The completion of B. rapa genome provides new insights into genome evolution in Brassicas and, importantly, the first Brassica reference genome. The B. oleracea diploid C genome is the second Brassica crop to undergo genome sequencing, using a combined Illumina and Roche 454 sequencing approach, and is expected to be released later this year.

B. napus is an allotetraploid formed from hybridization of B. rapa and B. oleracea and thus possesses a large and complex genome: AC 2n = 19. Nowadays, genomics investigation of tetraploid B. napus is faced with two critical problems: (1) homeologous regions of high sequence identity between the A and C genomes, which interfere with sequence read mapping and thus accurate assembly and correct assignment of homeologous A and C chromosomes; (2) Numerous repeat sequences, including simple repeat sequences, minisatellites, satellites and different categories of transposons, which are enriched in B. napus and also hinder mapping and accuracy of assembly. Hence, strategies optimized to study the genome of tetraploid B. napus are required, as follows (Fig. 1):

Fig. 1
figure 1

Strategies for B. napus whole-genome sequencing and future emphasized research on B. napus genome. In the sequencing of B. napus, double haploid (DH) lines are necessary to avoid the interference of allelic variants on sequence assembling. Diploid progenitor genome and high-density genetic map are also critical for the accurate mapping of homeologous regions and genome rearrangements relative to reference sequences. A combination of BAC-pooling sequencing and whole-genome short-read sequencing will be used to produce a reference sequence of B. napus. Expressed sequence tags or BAC sequences sequenced by the Sanger sequencing technique will be used to verify the reference genome. With the production of high-quality reference genomes for Brassica species, future numerous works need be launched

  1. 1.

    The use of double haploid (DH) B. napus lines, benefiting from advanced microspore culture techniques in many Brassica labs. In these lines, each locus is homozygous, avoiding the interference of allelic variants on sequence assembly. In non-DH heterozygous lines, it is difficult to discriminate the allelic variants within the A or C genomes from the homeologous variants. Recently, an RFAPtools pipeline was developed, which was divided into three steps: firstly, a pseudo-reference sequence was assembled; secondly, single-nucleotide polymorphism (SNP) was discovered and genotyped, and finally allelic SNP was discriminated from homeologous loci. It combined with a double digestion RADseq (ddRADseq) approach in a B. napus DH population which was developed successfully to discriminate allelic variants from homeologous sequences (Chen et al. 2013).

  2. 2.

    The use of the concomitant genome sequences of B. rapa and B. oleracea, as the two parental species of B. napus, as reference genomes for B. napus genome assembly, followed by careful correction for any large genome rearrangements and variations between these progenitors and B. napus. Indeed, this is currently being applied in the B. napus genome sequencing project (Snowdon and Luy 2012).

  3. 3.

    The use of a high-density genetic map, for the accurate mapping of homeologous regions and genome rearrangements relative to reference sequences. High-density genetic maps enable the correct ordering of sequence contigs to accurately link physical and genetic maps. In Brassica, the availability of large mapping populations and high-throughput, genome-specific genotyping technologies make this highly applicable. For example, Illumina’s GoldenGate and Infinium SNP genotyping platforms are currently being developed and applied in Brassica species (Durstewitz et al. 2010), which can detect from 384 to 60, 000 genome-specific SNPs in parallel.

  4. 4.

    The use of a combination of BAC-pool sequencing and whole-genome short-read sequencing for accurate production of a B. napus reference genome. In this method, for example, a 100 kb insert BAC library of ~10,000 clones (10× coverage of B. napus) will be evenly and randomly divided into 1,100 pools (100 clones per pool). For each pool, three Illumina short-insert paired-end sequencing libraries will be constructed, sequenced with >50× depth and assembled. Supercontigs can be acquired by merging contigs from each pool using the overlap layout consensus (OLC) method. The redundancy in the assembly can be removed by self-to-self whole-genome alignment and sequencing depth information. Whole-genome Illumina libraries (200-bp to 20-kb inserts) then will be constructed with 100× coverage and used to aid assembly. Overall, BAC supercontigs can be assembled into Scaffolds and then into a reference genome, with the help of high-density genetic map and gap-filling. Finally, expressed sequence tags (EST) or Sanger-sequenced BACs will be used to verify the reference genome. If these sequences are successfully mapped in the reference genome with a high proportion (e.g., >99 %), the reference genome is of high quality.

Sequencing projects for B. napus are currently being mediated by 11 research institutes, with the intention to analyze genetic diversity in oilseed rape. Sequencing of the other three cultivated Brassica species, B. nigra, B. juncea and B. carinata, are also in the pipeline (See http://canseq.ca/).

Applications of NGS in Brassica species

SNP discovery

SNP density varies between and within species as well as different genomic regions. In rice, SNP density averages one in 147 bp (Subbaiyan et al. 2012), while soybean (Choi et al. 2007) and A. thaliana (Atwell et al. 2010) average 1/438 and 1/500 bp, respectively. Previously, the application of SNPs was limited because of the high cost of development and detection. SNPs were generally discovered through PCR amplification and Sanger sequencing of genomic regions of interest, or using DNA chips, which were laborious and time consuming. SNPs were also detected with computational tools based on existing EST databases. For instance, a SNP discovery pipeline in barley, wheat, rice and Brassica was developed through analysis of assembled EST sequences (Duran et al. 2009), whereby SNPs are identified by blast comparison or keyword search using AutoSNPdb (http://autosnpdb.appliedbioinformatics.com.au/).

With the advent of NGS technology, the discovery of large numbers of genome-wide SNPs is now highly achievable. Abundant markers can be discovered through amplicon sequencing, transcriptome sequencing, DNA-rich genome sequencing and whole-genome sequencing (Henry et al. 2012). Davey et al. (2011) reviewed genome-wide marker discovery using NGS in both model organisms and non-model species via reduced representation sequencing methods, including reduced representation libraries (RRLs), complexity reduction of polymorphic sequences (CRoPS) and restriction site-associated DNA sequencing (RAD-seq). The development, validation and application of SNPs had been reviewed systematically (Mammadov et al. 2012). Freely available software including SGSAutoSNP (Lorenc et al. 2012) enable rapid, high-throughput, accurate SNP discovery that can be applied to any species with available NGS data. SNP prediction can be complicated by the error rate of NGS and by repetitive or highly homologous regions causing misassembly of short-read lengths. However, the use of paired-end and large-insert NGS sequence reads in genome assembly and the strict quality control parameters in software, such as SGSAutoSNP (Lorenc et al. 2012), can help to minimize non-specific read mapping, and false SNP predictions.

Given that the current public Brassica reference genome is limited to the B. rapa A genome, several methods of detecting SNPs to reduce the size of target sequences were developed, such as transcriptome sequencing, EST-based sequencing, RAD sequencing and sequence capture using oligonucleotide probes. Trick et al. (2009b) developed a robust method to discover SNPs through transcriptome analysis of the polyploid B. napus cultivars, Tapidor and Ningyou7, using the Illumina platform. In this case, Brassica unigene sequences were used as a reference and aligned with sequence reads using MAQ software to discover SNPs. In total, 23,330–42,593 putative SNPs with different read depth were detected, and ~90 % of the SNPs detected were termed as hemi-SNPs, which were homozygous in one line but heterozygous in the other line (Mammadov et al. 2012). The hemi-SNPs between these two lines could be used for genetic mapping. In addition, Hu et al. (2012b) discovered 655 putative SNP markers by 454 sequencing of ESTs of two B. napus cultivars: ZY036 and 51070. Similarly, Durstewitz et al. (2010) identified 604 SNPs from ESTs in B. napus (one SNP per 42 bp), which were then validated using the Illumina GoldenGate SNP genotyping system. However, the primary limitation of SNP discovery from transcriptome sequencing or EST-based sequencing is the restriction to coding regions, and thus this failed to detect diversity in non-coding regions.

For genome-wide SNP discovery, RAD sequencing technology is a simple, alternative approach for detecting polymorphisms in complex crop genomes by reducing the complexity of the genome. More than 20,000 SNPs and 125 insertions and deletions were indentified in about 113,000 RAD clusters of the B. napus genome sequenced via the Illumina GAIIx system. At the same time, 26 out of 31 SNPs (84 %) in 16 RAD clusters were validated by Sanger sequencing (Bus et al. 2012). This is simple and effective in genetic mapping, but is limited for genome-wide association studies (GWAS) due to the small number of markers (Mammadov et al. 2012).

Sequence capture is also a technique that reduces the size of sequenced fragments and identifies homologies in the genome by rapidly tagging a targeted region for sequencing. It not only identifies meta-QTL regions associated with traits but also detects the variance of complex traits, like developmental and flowering traits (Snowdon and Luy, 2012). A total of 87 SNPs and 6 Indels were identified based on existing genomic resources in six B. napus varieties (Westermeier et al. 2009). Picho et al. (2010) presented the application of sequence capture in the B. napus cultivars Aviso and Montego and identified about 7,000 SNPs, which were useful for QTL mapping and genetic association studies. However, this technology requires a reference sequences for the design of capture probes.

Abundant Illumina read sequences of B. napus have been obtained via such complexity reduction methods. Despite combining NGS technology with bioinformatics for SNP discovery, in the allotetraploid B. napus, SNP discovery is currently complicated by the presence of highly homeologous ancestral A and C genomes and the absence of a reference genome sequence. Polymorphisms are only useful as inter-cultivar SNPs, and must be distinguished from intra-cultivar SNPs between the homeologous A and C chromosome regions. SNPs derived from both different alleles and homeologues are usually mixed together with above methods. Following the completion of B. oleracea genome sequence and de novo B. napus genome, comparison to the corresponding diploid species could be used to distinguish these two kinds of SNPs. In addition, Chen et al. (2013) has successfully developed ddRADseq approach, with bioinformatics RFAPtools, to discriminate allelic SNPs from homoeologous sequences in B. napus.

SNP arrays in Brassica

SNP arrays are the main method for large-scale SNP analysis, which can simultaneously detect millions of SNPs in one reaction. At present there are two main types of SNP array commercialized by Affymetrix and Illumina. The GeneChip Rice 44K SNP Genotyping Array was developed by Affymetrix and used to identify rice varieties and their genetic diversity (McCouch et al. 2010). The 135k Brassica exon Array representing 135,201 genes has been designed for whole-transcript profiling and mapping and for analysis of genome evolution and adaptation in the Brassicaceae family (Love et al. 2010). The Illumina 1M SNP array has the capacity to identify 1 million SNPs. Although the cost of NGS continues to decline, allowing genotyping by sequencing approaches to become more feasible, SNP microarray techniques retain some exclusive advantages. Firstly, they can provide robust analysis techniques using large public reference datasets at reduced when performing the replicated experiments. For Brassica, a public and high-density Illumina SNP array was released in 2012, combining efforts of 16 academic and commercial partners with Illumina Inc (Snowdon and Luy 2012). At present, more than 50,000 SNPs were found to function well in the Brassica A or C genomes using this SNP panel. This SNP array will offer advantages for GWAS and high-throughput screening of germplasm pools. At the same time, it is effective to identify unique genes or primary expressed genes in the genome (Parkin et al. 2010). However, the SNPs in this array are not evenly distributed across the genome, which may lead to bias in downstream analyses. Therefore, future SNP arrays should be developed based on evenly distributed, genome-wide A genome-specific or C genome-specific loci. NAM using a SNP array will aim to uncover the basis of yield and stress tolerance in B. napus (Edwards et al. 2013).

Genetic map construction and gene mapping

Traditional gene mapping methods for most quantitative traits are based on high-density genetic maps, which are constructed using large numbers of molecular markers. NGS followed by the identification of SNPs, and genetically or physically linked groups of SNPs (SNP haplotype) is an ideal tool to perform high-density mapping. Li et al. (2009) constructed a B. rapa linkage map with EST-based SNP markers and identified genes associated with flowering time and leaf morphological traits. In addition, the B. napus SNP linkage map was constructed based on SNPs discovered by Illuminsa sequencing (Bancroft et al. 2011). An integrated genetic map containing 5,764 SNPs and 1,603 PCR markers in B. napus was made through SNP genotyping, to produce a higher density, more accurate map than those previously available (Delourme et al. 2013). These SNP maps are applicable to researching complex traits and they are also critical to the assembly of scaffolds in whole-genome sequencing.

Sequencing mixed DNA pools from lines with extreme trait variants in a population enable development of novel molecular markers linked to genes of interest. This is more rapid than gene identification and cloning using conventional methods. Mapping-by-sequencing based on resequencing of bulked segregants using SHOREmap software package (http://1001genomes.org/software/shoremap.html) can identify candidate genes, but it is usually limited by the requirement of a completed genome as reference. Galvao et al. (2012) developed a synteny-based method to perform mapping-by-sequencing with few markers in species where whole-genome sequences were unavailable but transcriptome assemblies were available. This was validated to be effective for genetic mapping in A. thaliana and its distant relative B. rapa. Due to the complexity of the tetraploid Brassica genome, at present the diploid progenitor B. rapa genome can be used as a reference for candidate gene identification (Tollenaere et al. 2012). A total of 70 SNPs associated with rapeseed pod shatter resistance were discovered and a major QTL was found on chromosome A9 through the combination of NGS and BSA (Hu et al. 2012a). High-density genetic maps constructed by NGS technology is effective for the identification of SNP markers linked to target traits and can narrow the confidence intervals of QTLs of interest into smaller regions.

Association mapping

Traditional linkage mapping identifies the relationships between traits and linked markers following recent recombination events in biparental, structured populations. Meanwhile association mapping, also named as linkage disequilibrium mapping, can be used to identify genes linked with natural variation in populations. Although association mapping can be hampered by confounding population structure, leading to false positives and false negatives due to spurious correlations (Zhao et al. 2007), QTL mapping by association analysis can be a valid approach, for phenological, morphological and quality traits, e.g., in winter rapeseed (Honsdorf et al. 2010). Zhao et al. (2010) constructed a B. rapa core collection of 239 accessions for association mapping studies. The genetic loci for oil content, identified in association mapping, were also located within QTL intervals of linkage mapping in B. napus (Zou et al. 2010).

Candidate gene sequencing (CGS) and whole-genome scanning (WGS) of natural populations are the two main methods of association mapping. NGS offers abundant molecular markers, producing large quantities of genotyping data. In situations where no reference genomic sequence is available, WGS was carried out by applying SNP markers to gene expression variation data generated by RNA sequencing (Stower 2012). Given that the polyploid nature of B. napus complicates the assembly of genome sequences, and no reference genome is currently available, associative transcriptomics was proposed as a method to link molecular markers with trait variation indentified in B. napus (Harper et al. 2012). This study found that QTLs for the glucosinolate content of seeds were located on genomic regions showing presence–absence variation for the gene of interest in the population. This research offers for a model pipeline for association genetics in species with complex genomes.

However, common association mapping has reduced ability to detect minor-effect QTLs. Hence, the approach of NAM was proposed to solve the problem of various minor-effect QTLs. A NAM population is composed of several recombination inbred line (RIL) families that are derived from the cross of diverse inbred lines to a single reference inbred line. This combines the advantages of linkage analysis and association mapping, enabling analysis of recent recombination events from segregation progeny and historic recombination events from parental inbred lines. NAM can also produce high-resolution mapping with high allele richness for QTL detection, for example, for quantitative resistance traits (Poland et al. 2011) and genetic compositions of complex traits (Cook et al. 2011). The construction of NAM populations of Brassica is underway for genome-wide association analysis of complex traits (Cowling and Balazs 2010). But the disadvantage of this method is that the construction of NAM population is time-consuming, e.g., the hybrids between more than one parental line and the reference line were self-fertilized for six generations, which is a long process.

Transcriptome analysis in Brassicas

Transcriptome sequencing is an alternative approach to reduce the size of test sequences but obtain almost equal gene information. In particular, NGS provided a new tool for transcriptome sequencing even where genomic sequence information is not available. The first technology used widely in transcriptome sequencing was Roche 454, due to its long sequence reads. Subsequently, the appearance of Illumina and SOLiD technologies, with high-throughput and relatively short-read lengths, dominated transcriptome sequencing in Brassica (Table 3). Bancroft et al. (2011) sequenced the leaf transcriptome of B. napus as well as its progenitor species, B. rapa and B. oleracea using Illumina NGS. The SNP linkage map comprising 23,037 markers in B. napus was constructed after the analysis of sequence variation in these species. Transcriptome sequencing with NGS can produce much important information in relation to gene discovery (Higgins et al. 2012), causal SNP discovery within genes, and the discovery of genomic structural loci, for instance, alternative splicing determinants (AS). Detailed information about SNP discovery is delineated above.

Table 3 The main application of NGS in Brassica

Digital gene expression

In this system, the transcript levels of given genes are quantified in silico by sequence read profiling, also termed digital gene expression. This aims to determine the level of gene expression in particular biological processes, developmental stages and following various treatments, based purely on NGS read abundance for a specific locus. Previously, DNA microarray technology was used for this, whereby the hybridization intensity determined the levels of gene expression. Trick et al. (2009a) developed a public Brassica microarray resource, using the assembly of about 800,000 EST sequences, to analyze gene expression in resynthesised B. napus lines and their parents. However, important limitations of this method are (1) sequence information is required, (2) the cost is high, (3) the results can be too complex and inconsistent to clearly interpret and (4) the candidate genes are difficult to determine (Table 4). Serial analysis of gene expression (SAGE) (Obermeier et al. 2009) and massively parallel signature sequencing (MPSS) are another two traditional approaches for RNA sequencing. SAGE requires considerable sequencing reactions at high cost, while MPSS requires large quantities of mRNA (2.5–5 μg) to perform transcriptome analysis. Digital gene expression profiling based on NGS is becoming increasingly widely used in Brassica transcriptomics. This technique can discriminate homeologous gene expression in polyploids by comparing with the reference unigene sequences from diploid representative genomes (Higgins et al. 2012). Yu et al. (2012a) analyzed gene expression in a drought model of Chinese cabbage using Illumina NGS technology and found 1,092 genes associated with response to water deficit. In addition, over seven million ESTs were generated from four oilseeds including B. napus at four stages of development using 454 pyrosequencing, which can assist future functional and comparative genomic researches (Troncoso-Ponce et al. 2011). These suggest that digital gene expression has been of value in large-scale analysis of gene expression.

Table 4 Advantages and disadvantages of four methods of gene expression analysis

Gene discovery

In cases where genome sequences of species exist, unigenes can be obtained by assembly of RNA sequence reads (Wang et al. 2010; Chen et al. 2011). The transcriptome of tumourous stem mustard (B. juncea var. tumida Tsen et Lee) was sequenced by Illumina short-read technology, identifying 146,265 unigenes, in which 1,042 significantly expressed genes were associated with stem swelling and development (Sun et al. 2012). Zhou et al. (2011b) identified 7,155 genes related to chloroplast development in B. oleracea, determined the role of regulatory genes by RNA sequencing with the Illumina Genome Analyzer II system, and discovered 1,600 up-regulated genes in light signaling pathways in green curd tissue. mRNA sequencing with NGS is not only a powerful tool for the identification of the genetic basis of the certain traits but will also help to accelerate gene expression and functional analysis. Numerous related genes were discovered, but major genes and their function were not validated in most studies. Major genes can be found through previous QTL mapping results, which had been made extensively, and then can be identified by real-time quantitative PCR or resequencing candidate gene PCR products.

Mutational sites can be identified directly by deep sequencing with NGS technologies. Short-read sequencing has been successfully applied in the identification of frame shift mutations in A. thaliana, with high specificity and high sensitivity, without prior gene information (Laitinen et al. 2010). Such mutant screens can be applied to the Brassica genome. Mutations in larger targeted regions or whole exomes of mutant populations were detected in B. napus using Illumina sequencing (Plant and Animal Genome meeting) (Sidebottom et al. 2012).

Alternative splicing (AS)

AS refers to the various ways that splicing introns in one eukaryotic pre-mRNA may result in several different mRNAs and protein products. AS is important for determining the complexity of genes, and appears to occur in about 33 % of all rice genes (Zhang et al. 2010) with possibility of reaching 60 % in different tissues, ambient conditions or developmental stages (Syed et al. 2012). In Brassica, changes in gene characteristics and AS patterns are common after polyploidization, and AS changes greatly contribute to transcriptome shock, whereby extensive changes in gene expression pattern (Zhou et al. 2011a). However, dynamic changes in AS in whole genomes under different conditions or stresses, and the varied role of AS patterns in the evolution of plant species are still unknown. Akhunov et al. (2013) discovered a high level of AS pattern divergence in homeologous genes of wheat. This indicates that the dynamitic changes of AS in polyploidization, different developmental stages, different tissues or different treatments are a promising research direction in Brassica species. With the advent of B. napus genome sequencing, the AS events and the role of AS in polyploidization will be identified.

MicroRNA

MicroRNAs (miRNA) are a class of small non-coding RNA of about 18–30 nucleotides that regulate gene expression in plant development and response to environmental stress. miRNAs are wildly expressed in plants and animals, even in mosses and fungi, because of their conserved functions in developmental processes. These small RNAs are produced from hairpin-shaped precursors (pri-miRNA) with the help of the endonuclease, DCL1 (DICER-LIKE 1). In the past, the discovery of novel miRNAs depended on cloning and sequencing of individual miRNAs, which could not be distinguished from other non-coding RNAs, such as rRNAs or tRNAs. Emerging microarray technology can detect miRNA genotypes on a large scale but is limited in the capacity to detect novel miRNAs. NGS technology opens new opportunities for novel miRNA discovery and profiling, as well as the identification of miRNA targets. By analyzing small RNA profiles of the embryos of B. napus with different oil contents at different developmental stages, a total of 50 conserved miRNAs, 11 new miRNAs and some miRNA targets were identified (Zhao et al. 2012). Korbes et al. (2012) detected 59 B. napus miRNA families at different seed developmental stages using Illumina sequencing, in which 13 were novel miRNA families. The putative functions of miRNA target genes were associated with seed development and energy storage. In Chinese cabbage, 228 novel and 321 conserved miRNAs were found using Illumina NGS technology, which laid the foundation for further study of miRNA regulation mechanisms (Wang et al. 2012a). However, these studies only focused on the discovery of miRNAs, but the core miRNAs and their role in B. napus were less studied. Numerous works are still required to determine the function of miRNA in the polyploidization of B. napus. In addition, long non-coding RNA is a novel field and not reported in Brassica crops up to now. It will become a new star like small RNAs.

Noticeably, degradome sequencing is a vital method for identification of miRNA targeted mRNAs. Traditionally, target mRNA identification relies on computational prediction and subsequent experimental verification, which cannot only lead to inaccurate results but is also time-consuming. Degradome sequencing technology acts directly on the 3′-end of mRNA fragments with a poly-A tail, which are complementary with an miRNA, to figure out the target gene. Recently, Xu et al. (2012a) performed whole-mRNA degradome sequencing of B. napus and found 33 conserved and 19 new mRNA targets, providing for the first platform for miRNA regulation functional research in Brassica. This exploited the mixed mRNA samples from different tissues, and abundant targets were detected. But degradome sequencing can only detect the targets regulated by the transcript cleavage model, not by translational repression.

The differential expression of miRNAs in different developmental stages and diverse treatments is a main application of small RNA sequencing. Heavy metal cadmium-regulated miRNAs of B. napus were identified by Illumina sequencing technology and novel targets were found to participate in response to cadmium (Zhou et al. 2012). In B. rapa, miRNAs responding to heat stress were also found to play a key role in response to heat (Yu et al. 2012b; Wang et al. 2011a). Srivastava et al. (2012) showed via microarray profiling that miRNAs play a great role in response to arsenic stress in B. juncea. Indeed, both genetic and epigenetic factors, including heritable DNA methylation profiles and corresponding small RNA activities, contribute to phenotypic traits. Therefore, a combination of genome sequencing, single-base methylation sequencing, transcriptome expression analysis and small RNA analysis will form the basis of complex trait research.

Marker-assisted selection (MAS) and genome selection in breeding

It is essential to understand the association between phenotypic variation of traits of interest and their intrinsic genetic variation at the DNA sequence level in crop improvement breeding. Molecular markers can be used to accelerate the process of traditional breeding. In backcross breeding, selection toward a genetic background is a critical step in deciding the number of backcrosses with the recurrent parent. MAS can accelerate this process by tracking target genes with linked markers to eliminate linkage drag. Although the idea of MAS has been put forward for many years, successful breeding outcomes are rare and its use is usually blocked by most quantitative traits, because only a small number of markers associated with a phenotype are identified and genotyping cost is relatively high. Another method, termed genomic selection, is proposed to solve such problems in breeding, and estimates the values of individual lines with high density markers across the whole genome (Fig. 2). Heffner et al. (2009) made a simulation between the true breeding value and genomic estimated breeding value (GEBV) and the correlation coefficient between them reached 0.85, which indicated that GEBV could be used to estimate the true breeding value. Genomic selection is an ideal tool to select lines of interest with markers alone, circumventing the need for costly and time-consuming phenotyping of breeding lines. The accuracy with genomic selection using ridge regression-best linear unbiased prediction was higher than conventional MAS via composite interval mapping (Guo et al. 2012b). Likewise in Brassica, genomic selection will accelerate breeding cycles with the aim to meet the increasing demands of production. However, the lack of valuable markers linked to important agronomic traits has negatively impacted research into genomic selection in Brassica. In addition, it is a challenge to develop homeologous gene-specific markers because it is very difficult to select efficient loci which distinguishes different homeologous genes with high similarity.

Fig. 2
figure 2

The application of NGS in Brassica crop breeding

Future emphasized research area of NGS in Brassica

With the production of high-quality reference genomes for Brassica species, the stage is set for numerous genomics studies. For Brassica crops, we have emphasized some fields that require further analysis (Fig. 1). These are:

  1. 1.

    Detailed gene structure. The start and end of the promoter (including transcription factor binding sites (TFBS, untranslated regions (UTR) (including 5-end and 3-end UTRs), coding regions, introns, exons and even enhancers of each gene should be determined.

  2. 2.

    Characteristics of gene families and duplicated genes. Classification, evolution and transcriptional analysis of gene families and duplicated or homologous/homeologous genes should be done.

  3. 3.

    Spatial analysis of pseudogenes. The position, number and role of pseudogenes and their homologous/homeologous genes are worth investigation since they may play a role in species polyploidization and possibly the generation of small RNAs to regulate other genes (Guo et al. 2009).

  4. 4.

    Genome resequencing analysis. Genome resequencing should be done for SNP and Indel discovery and analysis of the position, copy number variation (CNV), and phylogeny of genes among different materials for analysis of species evolution or domestication. A haplotype map or genome variation map should also be produced.

  5. 5.

    Boundaries and novel function of repeat sequences. The boundaries of different kinds of repeat sequences especially transposon sequences should be defined. Though many tools are used to resolve this problem, there are no suitable tools that can adjust the prediction strategies according to the specificity of individual species (Myrick and Gelbart 2007). For Brassica crops enriched for repeat sequences, their boundary definition and classification is a big problem. In addition, the novel functions of repeat sequences should be clarified. For example, can some transposons generate small RNAs that regulate other coding genes? Which transposons are active, or can be induced to become active when environment changes? What roles do they play in maintaining species propagation?

  6. 6.

    Alternative splicing analysis. Alternative splicing events, new genes and fusion genes should be identified and resolved by high depth transcriptome sequencing.

  7. 7.

    Function and origin of small RNAs. The function of numerous small RNAs need be determined. Nowadays, NGS predicts or offers considerable miRNAs in Brassica species, but their functions, origins, generation mechanisms and variation/evolution remain unclear.

  8. 8.

    Relationship and function of genome composition. What are the relationship of different genome composition and their functions? Whether the genes, transposons and small RNAs are unevenly distributed (e.g., gene cluster, transposons cluster and small RNA cluster)? Wei et al. (2013) found that nested-LTR retrotransposons were distributed in six Brassica BAC clones, and these played great role in the formation of centromeres. If so, how does evolution or domestication affect this phenomenon?

  9. 9.

    Effect of evolution on genome. How does evolution, domestication or artificial selection shape genome structure, distribution and structural variation of different composites in the genome, and alter gene function, transposons and small RNAs?

  10. 10.

    Development of newer and user-friendly software. Although the cost of NGS continues to decline, it can still be prohibitive for analysis of large collections of various accessions. Meanwhile, vast amounts of data are generated from NGS, but errors still exist and intensive, professional computational tools are needed to store and deal with these large amounts of data. Currently, analysis tools are only mastered by a few trained professionals. This can be a limitation for obtaining enough useful sequence information from a lot of sequence reads. Newer, user-friendly software needs to be developed and applied in order to match the fast development of sequencing technology.

Outlook

Improvements in third-generation sequencing technology are promising for directly acquiring long sequence reads to reduce the complexity and cost of genome assembly. It is worth highlighting that the study of all species can be individualized to reveal the role of gene expression by high-throughput sequencing of respective tissue and organ, or individual plant response to different treatments and environment. The human genome project was completed in 2003 and had made considerable progress in the application of NGS technology in disease diagnosis and gene functional analyses. In Brassicas, information gained from the completed genome of B. rapa, for instance molecular markers, can be transferred to other Brassica crops. Xu et al. (2010) constructed an integrated genetic map of the A genome in B. napus using SSR markers originating from B. rapa sequenced BACs. In the near future, the release of the B. oleracea and B. napus genomes will greatly accelerate the large-scale application of NGS in Brassica species. This has important implications for downstream functional genomic and epigenetic analyses of the control of important agronomical traits in Brassicas and other complex crop species.