Background

Tobacco (Nicotiana tabacum L., 2n = 4x = 48) is an important model system in plant biotechnology [1], due to its unique advantages over other plant species. It not only has relatively short generation time and high protein content, but also can be easily genetically transformed [2, 3]. For this reason, tobacco has been widely used in studies on plant response to pathogens [4], pyridine alkaloid (like nicotine) biosynthesis [5], cell cycle [6, 7], oxidative stress [8] and pollen tube development [9]. More importantly, tobacco is an attractive green bioreactor proved to be able to produce a wide range of therapeutic proteins including antibodies [1012], vaccines [13, 14] and immunomodulatory molecules such as cytokines [15, 16].

Despite the prospective applications of tobacco in pharmaceutical production, limited cultivars exist with low nicotine and alkaloid contents. Breeding new cultivars suitable for pharmaceutical production is further complicated by the paltry genomic information available to the public. Genetic linkage mapping based on molecular markers permits the elucidation of genome structure and organization [17]. It provides critical information for quantitative trait locus (QTL) marker assisted selection. For some economic plants, including potato (Solanum tuberosum), tomato (Solanum lycopersicum), eggplant (Solanum melongena), pepper (Capsicum species) and Petunia (Petunia hybrida), whole genome sequencing and genetic linkage maps have elucidated their genome structures and assisted breeding cultivars with molecular markers [18]. Therefore, a high density genome-based linkage map of the tetraploid tobacco will improve current genetic research tools in search of new cultivars. Thus far, linkage maps for tobacco have been constructed by using low-throughput molecular markers like simple sequence repeats (SSRs), which resulted in low density linkage maps [19, 20].

Single nucleotide polymorphisms (SNPs) as the most abundant type of DNA variations are currently used as genetic markers for their wide distribution in the genome [21]. Compared to genetic markers based on size discrimination or hybridization, SNPs directly interrogate sequence variation and possess the potential of reducing genotyping errors [22]. SNP discovery is amenable to high-throughput next-generation sequencing (NGS) technologies, which produce DNA sequences at a rate several orders of magnitude faster than conventional sequencing methods [17].

According to unpublished data, the genome size of tobacco is approximately 4.5 Gb. Because of the huge genome, great challenges must be faced up to. Reduced representation library sequencing is an energetic approach, which has been used for many genome studies [23]. Restriction site associated DNA sequencing (RAD-seq) technology [2426] facilitates genetic variant discovery by allowing ortholog sequences to be targeted in multiple individuals [27]. This method relies on sequencing of DNA regions flanking the restriction sites of specific restriction enzymes. In brief, DNA fragments from the digestion of a chosen restriction enzyme are ligated with an adapter, which contains a molecular identifying sequence (MID) unique to each sample. The DNA sequences flanking each restriction site are sequenced via the massively parallel Illumina sequencing technology [28]. RAD sequencing is highly successful in re-identifying genomic regions controlling known phenotypes [2931].

To generate a high density genome linkage map for tobacco, we have developed here 4138 SNP markers using the Illumina HiSeq 2500 high-throughput platform. The mapping population was generated by crossing two tobacco (N. tabacum L.) cultivars. The F1 progeny was back-crossed to the parents. A total of 193 progenies were generated and all individuals were used for linkage map construction. We conducted SNP detection both with and without a reference genome, the latter referred to as de novo identification of SNP by RAD-seq (DISR). We compared these two methods and constructed a genetic map of tobacco based on a backcross (BC1) population.

Results

RAD library preparation and sequencing

A total of 196 sampled individuals from three generations, HD (Hong hua Da jin yuan), RBST (Resistance to Black Shank Tobacco), F1 (HD × RBST) and 193 BC1 progenies were used in the construction of 10 libraries used for RAD-sequencing (Table 1). In summary, 2641 Gb of raw data containing 26.4 billion pair-end 2 × 100 bp raw reads for approximately 2640 billion base pairs were obtained. Library detail information is provided in Additional file 1. We removed the following types of reads: (a) reads with >10 % unidentified nucleotides (N), (b) reads with >40 bases having Phred quality ≤7, and (c) putative PCR duplicates generated by PCR amplification in the library construction process (i.e., read 1 and read 2 of two paired-end reads that were completely identical). These reads were stringently filtered from the index sequences to get clean data for each sample (Fig. 1). Totally, 2481 Gb clean data contain 24.8 billion clean reads after filtering with an average volume of 12.11 Gb for each sample, at an average sequencing depth of 2.7× (the unpublished tobacco genome size is approximately 4.5 Gb).

Table 1 Library information and data output
Fig. 1
figure 1

The statistic of read number for each sample

SNP calling and genotyping

Two distinct protocols were executed in SNP calling and genotyping: the first was with a reference genome; the second was without a reference genome, which we refer to as DISR. In the first protocol, 24.8 billion clean reads were aligned to the reference sequences (unpublished data) using SOAPaligner [32] (Release 2.21, http://soap.genomics.org.cn/). The mapping results were processed with Samtools [33]. Variations were called using the Unified Genotyper (Version 3.1, Genome Analysis Tool Kit) [34]. Any nucleotide difference between reads and the reference genome was initially called as variant. A large volume output of 7,343,419 raw SNPs suggested improvement in data assemblage. Three parameters (genotype coverage, genotype quality, and SNP quality) generated by the Unified Genotyper were used as criteria for filtering variant output.

Using a maximum missing data (MMD) threshold of 45 % in the BC1 population for each locus, a total of 8664 SNPs (p < 0.01) were recovered. Although the criteria are much looser than many other studies [31], the effective genotype size is larger than 100, which is sufficient for linkage map construction. In total, 5286 markers (χ 2 < 15) were selected for genetic map construction by using JoinMap 4.0 [35] (Table 2).

Table 2 Statistics for SNPs based on the two different methods

In the second protocol (DISR), 181,770 raw SNPs were obtained after the clean reads were processed. Using the same MMD threshold as the first protocol, a total of 7457 SNPs (p < 0.01) were recovered. In total, 3282 markers were then selected (by the χ 2 test) for the construction of genetic map in JoinMap 4.0 [35] (Table 2).

Linkage mapping

The first linkage map from sequence with reference genome was constructed with a total of 8664 SNPs (p < 0.01) which generated 4138 markers and mapped 24 linkage groups (LGs) successfully with a total length of 1944.74 cM. The LGs ranged from 33.58 to 129.176 cM in length. Six LGs contained over 220 marker loci. LG09, LG23 and LG24 were the shortest LGs, spanning 73.937–107.485 cM, respectively, and comprising 65 loci, whereas LG05 was the largest LG, spanning 60.73 cM, containing 494 loci with marker density of 0.123 cM/locus. The marker densities ranged from 0.117 cM/locus in LG12 to 1.679 cM/locus in LG23, resulting in an average distance of 0.712 cM between markers for the entire map (Table 3; Fig. 2).

Table 3 Statistics of 24 linkage groups with the reference genome
Fig. 2
figure 2

Linkage maps based on the reference genome. This was constructed with a total of 8664 SNPs (p < 0.01) which generated 4138 markers mapping 24 linkage groups (LGs) successfully with a total length of 1944.74 cM. The LGs distance ranged from 33.58 to 129.176 cM. Six LGs contained over 220 marker loci and for these LGs Haldane’s map unit is used while for other LGs we used Kosambi’s map unit. The LG09, LG23 and LG24 were the shortest LGs, spanning 73.937–107.485 cM, respectively, and comprising 65 loci, whereas LG05 was the longest LG, spanning 60.73 cM and containing 494 loci with a marker density of 0.123 cM/locus. The marker densities ranged from 0.117 cM/locus in LG12 to 1.679 cM/locus in LG23, resulting in an average distance of 0.712 cM between markers for the entire map

The second linkage map from DISR was constructed with 7457 SNPs that gave 3282 markers. Out of those, 2162 markers successfully mapped 24 LGs with a total length of 2700.9 cM. The LGs ranged from 58.1 to 238.4 cM in length, and only one LG contained over 220 marker loci. LG24 was the shortest LG, comprising only 13 loci, whereas LG01 was the largest LG, spanning 159.9 cM and containing 224 loci with marker density of 0.7 cM/locus. The marker densities ranged from 0.5 cM/locus in LG02 to 5.6 cM/locus in LG24, resulting in an average distance of 1.8 cM between markers for the entire map (Table 4; Fig. 3).

Table 4 Statistics of 24 linkage groups without the reference genome (DISR)
Fig. 3
figure 3

Linkage maps based on DISR. This map was constructed with 7457 SNPs that produced 3282 markers. Out of those, 2162 markers successfully mapped 24 LGs with a total length of 2700.9 cM. The LGs ranged from 58.1 to 238.4 cM in length. LG24 was the shortest LG, comprising only 13 loci, whereas LG01 was the longest, spanning 159.9 cM and containing 224 loci with a marker density of 0.7 cM/locus (map unit determined by Haldane’s distance while for other LGs Kosambi’s distance was used). The marker densities ranged from 0.5 cM/locus in LG02 to 5.6 cM/locus in LG24, resulting in an average distance of 1.8 cM between markers for the entire map

Comparison of the DISR and the reference genome methods

Comparison was performed by presenting the ratio of the marker overlaps between the genetic maps based on reference genome and DISR. The consensus sequence was mapped back to the reference genome to mark the loci of the SNPs. After this process, the markers from the DISR method were compared with the markers generated from the reference genome method. Consistent markers were recorded and presented as a Venn diagram. In total, 677 overlapping markers, constituting 30 % of the DISR map and 16 % of the map based on reference genome were observed. All in all, 1535 makers were specified for the DISR map and 3461 markers for the map based on reference genome (Fig. 4).

Fig. 4
figure 4

Comparison of the two map versions. In total, 677 overlapping markers, constituting 30 % of the DISR map and 16 % of the map based on the reference genome were observed. All in all, 1535 makers were specified for the DISR map and 3461 markers for the map based on the reference genome

Discussion

Although tobacco has been proved to be an attractive green bioreactor for the production of therapeutic proteins, the paucity of cultivars with low nicotine and alkaloid contents has blocked its movement from bench to field scale. A high density genetic map can provide sufficient information to accelerate the genome breeding. Previous attempts for genetic linkage map construction for tobacco were achieved by using molecular marker based techniques, including restriction fragment length polymorphism (RFLP) [36], conserved ortholog sequences (COS) [37] and simple sequence repeat (SSR) markers [19, 20]. As the best of the three linkage maps, the SSR linkage map comprises 2318 SSR markers mapping to 2363 loci in 24 clearly defined LGs with a total length of 3270 cM [19] (Table 5). In comparison, our technique generated 4138 SNP markers for tobacco that defined 24 LGs with a total coverage of 1944.7 cM. This result is not only an improvement over those of previous reports, but also a confirmation of SNPs in providing excellent marker density for linkage mapping and genomic selection [38]. To our knowledge, the tobacco linkage maps from this study, particularly the map generated with a reference genome, provide the highest number of markers among all available population-specific linkage maps.

Table 5 Comparison of linkage maps for tobacco

The Mendelian basis of quantitative traits provides a genetic framework for the dissection of polygenic traits [39] and can pave the way for the identification of candidate loci controlling the inheritance of complex traits. NGS technology makes it possible to achieve dense SNP marker coverage of genomes without the need for a reference sequence [24, 26]. An example of this is restriction-associated DNA sequencing (RAD-seq), which was originally developed as a tool for genetic mapping in fish and fungi [29] and later expanded to many other species, including plants (Lolium perenne L., Momordica charantia, Corchorus olitorius L.) [25, 30, 40, 41]. In this study, a separate linkage map via the DISR method was also obtained, which did not need a reference genome. The DISR linkage map contains 2162 markers with a total coverage of 2700.9 cM and an average distance of 1.8 cM between markers. It demonstrates that these two high density linkage maps are compelling tools for gene (Table 5) and QTL mapping and marker-assisted breeding [42].

A comparison of the two maps showed an overlap of 677 markers (Fig. 4). We compared the ratios of overlaps between the two protocols and found that the use of a reference genome was more efficient than without a reference genome. In the method of DISR, the information of only one end of the pair reads is used for the SNP calling. However, if we conduct the SNP calling with a reference genome, whole genome information is used. This kind of experiment is often required in nature, particularly in building linkage maps for species that do not have a complete genome sequence database. However, an integration of the two protocols could result in a higher density map and thus, assist in the breeding of other low nicotine and alkaloid content cultivars.

Conclusions

Using next generation RAD sequencing technology for two distinct SNP discovery methods, we have respectively mapped 2162 and 4318 SNPs in tobacco. This study gives an excellent example for high density linkage map construction, irrespective of reference genome sequence availability, and provides saturated information for downstream genetic investigations such as QTL analyses or genomic selection (e.g. bioreactor suitable cultivars).

Methods

Mapping population

Two tobacco varieties, Hong hua Da jin yuan (HD) and Resistance to Black Shank Tobacco (RBST) were used to develop the BC1 inbred population. HD is a high leaf mass cultivar from southwest of China. RBST has high resistance to tobacco black shank disease. The BC1 inbred population was generated through a (HD × RBST) × HD crossing in a breeding unit in Yuxi of Yunnan Province.

RAD library preparation and sequencing

Fresh young leaves were collected from HD, RBST, F1 (HD × RBST) and 193 individuals of BC1 (F1 × HD) population. Leaf samples were snap frozen in liquid nitrogen and stored at −80 °C. Genomic DNA isolation and purification were conducted using a DNA extraction kit (Qiagen). DNA quality was analyzed in 1 % agarose gel. The concentration of extracted DNA was determined by a spectrophotometer. Approximately 15 μg of purified DNA was processed to obtain 10 RAD libraries, each including about 20 individuals following the protocol of Baird et al. [29] and the instructions of the reagent manufacturers. Genomic DNA from individual samples was digested with EcoRI (New England Biolabs). Individual specific barcodes were ligated with an adaptor by T4 DNA ligase for sample multiplexing. Ligated DNA samples were pooled and sheared, and consequently electrophoresed to isolate DNA fragments with sizes of 300–700 bp in 1.5 % agarose gel. Quick Blunting Kit (New England Biolabs) was used to generate phosphorylated blunt ends. Klenow Fragment (3′ → 5′ exo-; New England Biolabs) was used to add adenosine to the 3′ end. An adapter with divergent ends (P2 adapter) was ligated to enable selective PCR. The samples were PCR-amplified and the libraries purified with MinElute column (Qiagen) to obtain approximately 100 μl (>50 ng μl−1) of sequencing libraries. The obtained RAD libraries were sequenced on an Illumina HiSeq 2500 in 100 bp pair-end reads.

SNP calling with reference genome

The raw reads were removed using the following criteria: (a) reads with >10 % unidentified nucleotides (N), (b) reads with >40 bases having Phred quality ≤7, and (c) putative PCR duplicates generated by PCR amplification in the library construction process (i.e., read 1 and read 2 of two paired-end reads that were completely identical). All the obtained short clean reads were aligned to reference sequences (unpublished data) using SOAPaligner (Release 2.21, http://soap.genomics.org.cn/) [32]. During alignment, long reads with high error rates at 3′-ends were substituted with 5′ 32 bp subsequence as seeds. The entire lengths of the reads were used. Five mismatches in one read were allowed (important arguments: -l 32 -v 5). The mapping results SAM files were converted with Samtools [33]. Variations were called using the Unified Genotyper (Version 3.1, Genome Analysis Tool Kit) [34]. Any nucleotide difference between reads and the reference genome was identified as a variant. This criterion generated a large variant output, which was filtered by three parameters generated with the Unified Genotyper, including genotype coverage, genotype quality, and SNP quality.

SNP calling without reference genome (DISR)

Besides, the method based on reference, we have attempted to call SNPs by DISR. Instead, we used a multistep process to identify RAD tag loci within populations, assign a consensus sequence to each individual at each RAD tag locus, and align consensus sequences across populations (Fig. 5). A flowchart is also provided for clarity in Additional file 2.

Fig. 5
figure 5

SNP calling based on DISR. a Nicotiana tabacum L. has 24 nuclear chromosomes, each of which contains multiple EcoRI cut sites (red marks). The genomic DNA is digested, bar coded with a population-specific sequence, and amplified resulting in multiple sequence reads from each of the RAD tag sites in the genome. Each sequence consists of a population-specific 5-bp barcode (black), the enzyme-recognition sequence (red), and the downstream sequence. b The de novo RAD tag pipeline compares all the sequenced reads and builds clusters of exactly matching tags. c Pair wise comparisons are made between all clusters. d There is a cluster in the locus that is SNP. e The consensus sequence for that RAD tag site within the population

Within each individual, identical reads were aligned together into clusters (other study termed it as stacks) (Fig. 5b–d). The pairwise sequence divergence among clusters was used to group them into putative loci (Fig. 5e). Loci were defined as a set of clusters such that for each cluster there is another cluster in the locus that is at most one nucleotide divergent. Clusters containing excessive numbers of sequence reads can occur when multiple, repetitive sites in the genome are all within a single nucleotide of one another. For this analysis, all clusters with a depth of coverage greater than two standard deviations above the mean cluster depth were removed and the remaining clusters were merged into a locus. For each nucleotide site in a locus, a likelihood ratio test of the read counts of alternative nucleotides was used to test whether the allele frequency of the most observed nucleotide was significantly larger than a threshold p following the method of Emerson et al. [43]. After these processes, an in-house perl script was used to integrate the clusters of parents and F1 progeny into a catalog and create a set of all possible loci in a mapping cross. Then, clusters of BC1 progenies are matched against the catalog to determine the genotype at each locus in every individual in the cross population.

Genotyping and linkage mapping

Distorted markers (p < 0.01) were filtered off to construct a genetic map by a Chi square test (χ 2 < 15 was selected for JoinMap 4.0) [35]. LGs were identified with an independent logarithm of odds (LOD) threshold of 7. Due to the large number of markers segregating in the population, if the number of the linkage group is more than 220, we used (in JoinMap 4.0) a maximum likelihood algorithm mapping the marker order for calculation efficiency [44]. We also calculated genetic distances (cM) using Haldane’s mapping function. However, the scope of corresponding linkage groups (3000–6000 cM) exceeded JoinMap 4.0 and therefore, the linkage length was divided by 100 for map presentation. In other linkage groups whose maker number was equal or less than 220, a linear regression algorithm and Kosambi’s mapping function was used for map construction and genetic distance estimation [45]. Following the initial mapping, potential errors that appeared as doubtful double-recombinants were identified using genotype probabilities function of JoinMap 4.0 [35] (p < 0.001). The suspicious genotype was replaced by a missing value as suggested by Isidore et al. [46] and Van Ooijen [35]. A linkage map was then constructed afresh using the corrected dataset. Potential error elimination and linkage map construction was iterated until no dubious genotype was identified. Markers with >45 % missing values or distorted (χ 2 test, p < 0.001, d.f. = 2) were removed in each step of the iteration.