Background

Single nucleotide polymorphism (SNP) refers to an allelic single-base variationbetween two haplotype sequences in an individual or between any paired homologouschromosomes across homogenous members. SNPs are most abundant among genomic DNAvariations and ubiquitous in both functional genes and non-coding regions [1]. Because they are conserved during evolution, associated with genetictraits, and suited for high throughput genotyping, SNPs are a popular and powerfultool for various genetics and genomics studies, such as mapping of whole genomes,tagging of important traits, comparison of genome evolution, classification ofdiverse clades, and many rapidly developing areas such as pharmacogenomics andfunctional proteomics [24]. These SNPs from expressed sequence tags (ESTs) represent hundreds ofthousands of functional genes and likely control many genetic traits [58]. Due to degeneracy of most three-nucleotide genetic codons, a SNP in thecoding regions may be synonymous (sSNP) if it does not result in change of theprotein sequence or non-synonymous (nsSNP) if it does. The nsSNPs are usually morebiologically relevant because the resulting amino acid changes in proteins maychange their secondary structures and functions and cause phenotypic mutations [1, 8, 9].

SNP discovery usually is accomplished through computational alignment of redundantDNA sequences with each other or with a high-quality reference genome wherediscrepant nucleotides can be detected and evaluated. For the redundancy-basedcomputational approach, in addition to sequencing errors as a source of false SNPs [5, 7, 10], it may be even more challenging to distinguish real SNPs among allelicsequences from single nucleotide discrepancies among highly identical paralogoussequences [8, 11]. Several bioinformatics programs (pipelines) have been developed forautomatic SNP mining, using different input data, computational algorithms, qualityevaluation strategies, and/or output formats. For example, the PolyPhred andPolyBayes pipeline typically requires sequence trace files or extracted sequenceswith base calling quality values to minimize false SNPs resulting from sequencingerrors [1214]. PolyBayes also includes an extra implementation to identify paralogs andtheir derived false SNPs [13]. Others like autoSNP and QuailitySNP can accept sequences without qualityfiles for initial redundancy-based detection, and then grade SNPs by confidencelevels, which are more commonly used with public ESTs that usually do not have traceor quality files [8, 15]. The QualitySNP pipeline implements a haplotype reconstruction algorithmand confidence scoring approach to detect reliable synonymous and non-synonymousSNPs from public ESTs without quality files and a reference genome [8]. In other words, it re-clusters ESTs in a contig to determine thepotential haplotypes in the contig. Only single discrepant nucleotides between anytwo reconstructed haplotypes would be scored a potential SNP. Sequencing differencescan also result from sequencing errors or alignment of paralogs. Only thosepotential SNPs passing additional confidence interrogation are identified as qualitySNPs. Reliable quality SNPs represent the different alleles (haplotypes) of a gene.As opposed to low-confidence and false SNPs, the use of quality SNPs can benefitallele-trait association studies [8].

Most citrus species are diploid (2n = 2× = 18), withhighly heterozygous and relatively small genomes and over 30,000 predicted genes [16]. In general, citrus refers to true biological species and ancestrallydomesticated introgressions in Citrus and those in the sexually compatibleFortunella (kumquat) and Poncirus (trifoliate orange) genera.Citrus fruit types are diverse, and include sweet orange (Citrus sinensis),mandarin (C. reticulata), grapefruit (C. paradisi), lemon (C.limon), lime (C. aurantifolia), pummelo (C. maxima), andcitron (C. medica). Each type consists of many cultivars primarily selectedfrom spontaneous bud sports, chance seedlings, induced mutants, or conventionalhybrids. It is widely believed that only C. maxima, C. reticulata,and C. medica are true species, although the binomial names for the otherancestral hybrid and introgression cultivars are widely accepted and used [17, 18]. These citrus types likely vary in levels of heterozygosity and sharealleles resulting from early introgressions across these genomes, according to SSRmarkers [1921]. A haploid Clementine genome sequence was produced using Sangertechnology, and one diploid sweet orange genome using Roche 454 technology [22], along many other citrus genomes using other re-sequencing platforms(Gmitter et al. unpublished data). Together with other available citrus genomicresources, it is now possible for SNP detection and comparison of large-volumecitrus Sanger EST datasets within and among different citrus cultivars. Thesegene-based SNPs, once available for the citrus community, will be very valuable inmany genetic and genomic studies, and helpful for trait-targeted breeding as well [20, 21, 23].

In this paper, SNPs in public ESTs from 27 different citrus genotypes were detectedby the QualitySNP pipeline and compared to estimate the heterozygosity of eachgenome. All of the short SNP oligo sequences were also aligned with the Clementinecitrus genome to determine their distribution and uniqueness in the genome and forin silico validation. Selected SNPs were also validated by SNaPshot andsequencing.

Methods

Citrus ESTs and cultivars

All citrus ESTs were retrieved from the National Center of BiotechnologyInformation (NCBI) EST database or ftp repository if available. There were 27citrus cultivars or biotypes with ESTs (Table 1,Additional file 1). In addition to the binomial andcommon names, the abbreviations for 27 cultivars were designated to facilitatepresentation (Table 1, Additional file 1); the binomial names are those used for the accessionsin the NCBI database. ESTs were searched for SNPs using the QualitySNP pipeline [8] in each of the 27 cultivars and in three cultivar groups, 12mandarins (M12), 7 limes/lemons/citron (L7), and all 27 cultivars (C27). Themining results for individual cultivars in the three groups were summed, givingSM12, SL7, and SC27, respectively used to compare with of M12, L7, and C27(Additional file 1). 'Ridge Pineapple’ sweetorange (Citrus sinensis) was selected for SNP validation because themost ESTs and SNPs are from sweet orange and it is a parent to several widelyused mapping populations.

Table 1 Public ESTs in citrus cultivars/biotypes

SNP discovery and primer design

The QualitySNP pipeline was installed and used for SNP discovery, following theprogram manual and recommended parameters [8]. QualitySNP first identified haplotypes in a contig by re-clusteringits ESTs and extracted all nucleotide discrepancies (called potential SNPs,pSNPs) between identified haplotypes in a contig, from which a subset ofso-called quality SNPs (qSNPs) was identified based on allele and SNP confidencescores defined in the haplotype-based mining algorithm [8]. These qSNP-containing contigs and 25-mer oligo sequences, along withmuch other mining information, were saved in separate files for databaseconstruction and result summary. The ratios of qSNP/pSNP were calculated toindicate the percentage of nucleotide discrepancies (pSNPs) identified ashigh-qaality SNPs (qSNPs) by the QualitySNP algorithm. Bioinformatics programsincluded in the pipeline were cross_match in the phred-phrap-consed package [24, 25] to remove vectors, CAP3 [26] to assemble ESTs, FASTY [27] to align ESTs to the proteins in the Uniprot database foridentification of non-synonymous and synonymous SNPs. BatchPrimer3 [28] was used to design a forward (F), a reverse (R), and a single baseextension (SBE) primer flanking each SNP site. The F, R and SBE primers of 96SNPs from SO were selected for both sequencing and SBE genotyping validation(Additional file 2). After sorting by the lengths ofSBE primers, except the first, the other 7 primers of every 8 SBE primers weretailed in the 5’ end with three groups of non-homologous polynucleotidesof different lengths to facilitate future multiplex genotyping application. Allthe F, R and tailed SBE primers, 96 each, were synthesized by Eurofins MWGOperon (Huntsville, Al) in a 96-well plate, respectively, where every threeprimers of each SNP were placed in the same well of the three different platesand stored in ddH2O at 10 μM. The format facilitated easyprimer positioning and channel pipetting during the genotyping and sequencingpreparation.

SNP 25-nucleotide sequence blast

All 25-nucleotide oligo sequences (SNP in the middle nucleotide) generated fromevery citrus genotype by QualitySNP were combined together and used to align tothe haploid Clementine reference genome (version 1.0; phytozome.org andcitrusgenomedb.org) using BLASTN [29] and a cut-off e-value of 6e-004 (0.0006). Each query sequence (25-meroligo) against the subject scaffolds would yield either of the following BLASTNoutputs, “no hits found”, 1 hit on 1 scaffold with 1 alignment, orany other cases (i.e., 1 hit on 1 scaffold with 2+ alignments at differentpositions or 2+ hits on different scaffolds with 1+ alignment each hit). At thepreset e value, only alignments with 84% identities and higher (in other words,only 6 types of alignment hits: 25/25, 24/25, 24/24, 23/23, 22/22, and 21/21),were saved in the BLASTN output file. The information in the output file,including the scaffold, position, strand, e value, score, alignment identitiesof each hit, and hit status, was parsed into an EXCEL file to summarize SNPalignment status and to calculate distribution on the Clementine referencegenome scaffolds. The information was also used as additional criteria forcategorization of SNPs and selection of desired core sets.

SNP validation by sequencing and SNaPshot genotyping assay

BigDye Terminator V3.1 Cycle Sequencing Kit and SNaPshot Multiplex Kit (AppliedBiosystems, Foster City, CA) were used to validate SNPs, following themanufacturer’s protocols with some modifications in reaction volumesand/or quantity of proprietary reagents. 96-well plates were used for PCR,enzymatic incubation, and denaturation on iCycler (Bio-Rad, Hercules, CA) and/orGeneAmp PCR System 9700 (Applied Biosystems, Foster City, CA), and forgenotyping and sequencing on 3130xl Genetic Analyzer (Applied Biosystems, FosterCity, CA). Unless otherwise stated, brief centrifugation up to 1000 rpm inJuan MR 23i was applied after addition of a solution or before implementation ofnew steps, and all the PCR and enzymatic incubation programs were set to hold at4°C indefinitely at the end until a next procedure.

For both dye terminator sequencing and SNaPshot assays to validate SNPs, templatepreparation was carried out in 10 μl in each well consisting of3.3 μl ddH2O, 1.0 μl 10x dNTPs (2 mM),2.0 μl 5x colorless GoTaq Flexi buffer, 0.8 μl 25 mMMgCl2, 0.4 μl F and R primers each,0.1 μl GoTaq Flexi (5 units per μl Promega, Madison, WI), and2 μl genomic DNA (10 ng/μl). The touch-down PCR programstarted from an initial denaturation at 94°C for 3 min, followed by10 cycles of 93°C for 30 sec, 56°C for 45 sec(decreasing 0.5°C each annealing step), 72°C for 45 sec, and 30continuing cycles with 51°C at the annealing step, plus a final elongationat 72°C for 15 min. Removal of primers and unused dNTPs was performedby addition of 1 μl of ExoISAP-IT (Affymetrix, Santa Clara, CA) intoeach well of the plate, and incubation at 37°C for 60 min and75°C for 15 min.

Sequencing reactions for SNP validation were prepared in 10 μl in eachwell of a new plate including 2 μl 5x sequencing buffer,2 μl ready reaction premix in the sequencing kit, 1 μl10 μM SNP F primer, and 5 μl ExoSAP-IT treated PCR product,started at 95°C for 1 min, followed by 25 thermal cycles of 95°Cfor 10 sec, 50°C for 5 sec, and 60°C for 4 min.Following the manufacturer’s instructions, ethanol/EDTA/sodium acetateprecipitation was used to purify the sequencing product in the plate, which wassubsequently air dried, then mixed with 2 μl ddH2O and6 μl Hi-Di formamide in each well, denatured, and loaded to thegenetic analyzer to sequence. The sequence files generated were analyzed bySequencing Analysis software (Applied Biosystems, Foster City, CA) to generatesequences and electropherograms, in which a validated SNP was confirmed bycorrect alignment of SBE primer sequence into the corresponding sequences andvisualization of two different overlapped nucleotide peaks at the nucleotidesite in the electropherograms.

The SBE reaction for SNaPshot assays was prepared in 5 μl in each wellin a new plate including 0.5 μl ready reaction premix in the SNaPshotkit, 1 μl SBE 10 μM primer, and 3.5 μl ExoSAP-ITtreated PCR product, and repeated in 25 thermal cycles of 95°C for10 sec, 50°C for 5 sec, and 60°C for 30 sec. Removal ofunincorporated dye-labeled ddNTPs was completed by addition of 5 μlSAP mix (3.5 μl ddH2O, 1.0 μl 10x SAP buffer, and0.5 μl 1u/μl SAP) into the SBE reaction mix, and incubation at37°C for 60 min and 75°C for 15 min. Genotyping wasperformed using 8 μl mix in each well of a new plate consisting of1 μl SAP treated SBE product, 0.25 μl GeneScan 120 LIZ sizestandard, and 6.75 μl Hi-Di formamide, which was denatured at95°C 3 min then immediately moved on ice for at least 2 min. TheSNaPshot files were used to score SNPs by GeneMarker (SoftGenetics, StateCollege, PA) in which a validated SNP consisted of two differentnucleotides.

Results

Haplotype-based EST-SNPs in citrus cultivars

Haplotype-based SNPs were mined from ESTs of the 27 citrus cultivars and 3 groups(M12 – 12 mandarins, L7 – 7 limes/lemons, and C27 – all 27combined) using the QualitySNP pipeline and summarized in detail (Additionalfile 1). In summary (SC27 – the last column inAdditional file 1), a total of 25,417 qSNPs(Additional file 2) were identified from ESTs of the27 cultivars mined separately. These are attributed to heterozygosity withincultivars at SNP loci. There were only 2805 SNPs duplicated according tocomparison of all the 25-mer oligo sequences. The percentages of the 7 SNP typeswere similar among most citrus cultivars with each type of quality SNPs found.Among the 25,417 qSNPs summed from the 27 citrus cultivars, 15,010 (59.1%) weretransitions (AG and CT), 9,114 (35.9%) transversions (AC, GT, CG, and AT), and1,293 (5.0%) insertion/deletion events (indels). On average, there were 2.4 SNPsper contig and one SNP every 1,064 bp in all of the SNP-containing contigsequences (Figure 1; Additional file 1).

Figure 1
figure 1

Percentages of the 7 SNP types, AG, CT, AC, GT, CG, AT, and indel,discovered from citrus ESTs. Presented here are 9 selectedcitrus cultivars, 3 groups, and 3 sums. SO, Sweet orange; CM, Clementinemandarin; PM, Ponkan mandarin; SM, Satsuma mandarin; ML, Rangpur lime;BO, Sour orange; GF, Grapefruit; NK, Nagami kumquat; TO, Trifoliateorange; M12, SNPs from ESTs combined from 12 mandarins (2–13 inTable 1), L7, SNPs from ESTs combinedfrom 7 limes / lemons (14–20 in Table 1); C27, SNPs from all ESTs combined (1–27 inTable 1); SM12, SL7 and SC27, therespective sum of the 12 mandarins, 7 limes/lemons, and all 27cultivars. On the average of the 27 cultivars (SC27), transitions (AGand CT) account for 59.1%, transversions (AC, GT, CG, and AT) for 35.9%,and insertion/deletions (indels) for 5.0%.

For individual cultivars, their numbers of ESTs were different, soconsequentially were their quality SNPs and other related numbers. For example,in SO, 213,830 ESTs yielded 7,404 contigs of >=4 ESTs. Of these, 4,228contigs contained 43,655 potential SNPs and 3,327 contained qSNPs. The totalnumber of qSNPs was 11,182. In other words, there was only one haplotypedetected in 3,176 contigs (7,404 minus 4,228) and no quality SNP identified inthe additional 1,001 contigs (4,428 minus 3,327) with potential SNPs. There were3.4 quality SNPs per contig and one quality SNP per 723 bp in the contigson average. Of these 11,182 qSNPs, 6,822 (61.0%) were transitions (AG and CTtype), 3,879 (34.7%) transversions (AC, GT, CG, and AT type), and 481 (4.3%)insertion/deletion (Indels); and 2,619 (23.4%) were nsSNPs and 4,038 (36.1%)were sSNPs. The absolute numbers of quality SNPs were not comparable due tovarying numbers of ESTs among citrus cultivars, but the number of potential andquality SNPs from each cultivar were strongly correlated with its number ofESTs; more ESTs yielded more usable contigs (>=4 ESTs) available for SNPmining, as well as more quality SNPs (Additional file 1). Given the large differences in the numbers of ESTs availableamong the various cultivars, it is more interesting to compare SNP frequencies,rates, and ratios among cultivars with substantial EST numbers and distinctgenetic backgrounds, and differences between the mining results of the threegrouped ESTs (M12, L7, and C27) and the three sums/averages (SM12, SL7, andSC27) of separately mined counterpart individuals. These comparisons will beelaborated hereafter.

Haplotypes detected in contigs with SNPs

One important feature of QualitySNP is to re-cluster ESTs in a contig toreconstruct and determine the haplotypes in that contig, from which only singlenucleotide discrepancies between any two defined haplotypes (allelic sequences)are considered as potential SNPs for further quality and confidenceinterrogation. Only those potential SNPs passing confidence scores areidentified as quality SNPs. In Additional file 1, allthe haplotypes detected in the SNP-containing contigs from all the 27 citruscultivars are included. Theoretically, there should be only a maximum of 2haplotypes detected in a diploid genome. As expected, a vast majority ofSNP-containing contigs consisted of two haplotypes, but the percentages of 2haplotypes varied in a wide range in these citrus cultivars (Figure 2, Additional file 1). Among thehighest were ML (92%), SC (84%), and GF (76%), and among the lowest PM (38%), KL(42%), and CM (48%). The variation likely results from the genetic makeup of the“cultivar” used to generate the ESTs. For example, ESTs for SO camefrom navel oranges, blood oranges, and others named C. sinensis, ratherthan a single genotype. In contrast, other “cultivars” are likelysingle clones. It was also evident as expected that much lower percentages of 2haplotypes were found in three combined EST datasets (M12, 44%; L7, 70%; andC27, 34%) due to introduction of more haplotypes from different types of citruscultivars, compared to their counterpart averages of each group (SM12, 48%; SL7,74%; and SC27, 53%). As a consequence, more qSNPs in higher qSNPs/pSNPs andqSNPs/ESTs ratios were found in the three grouped EST datasets (M12, L7, andC27), compared to their counterparts (SM12, SL7, and SC27) summed from theindividually mined cultivar EST results, but the ratio of contigs with qSNPs andcontigs used was the opposite (Figure 3, Additionalfile 1). The frequency of qSNPs is much higher in thepooled data for the three groups (M12, L7 and C27) than in the summed data forindividual cultivars. This is because the group values include polymorphismamong homozygous accessions as well as heterozygosity within cultivars, whilethe summed data include only SNPs due to heterozygosity. In other words, thenucleotide at such a SNP is very likely homozygous within a genotype, making ituseless in genetic linkage mapping of that genotype.

Figure 2
figure 2

Percentages of detected haplotype numbers (2, 3, 4, and >=5) incontigs (>=4 ESTs) with potential SNPs. Presented here are 9selected citrus cultivars, 3 groups, and 3 sums. SO, Sweet orange; CM,Clementine mandarin; PM, Ponkan mandarin; SM, Satsuma mandarin; ML,Rangpur lime; BO, Sour orange; GF, Grapefruit; NK, Nagami kumquat; TO,Trifoliate orange; M12, SNPs from ESTs combined from 12 mandarins(2–13 in Table 1), L7, SNPs from ESTscombined from 7 limes/lemons (14–20 in Table 1); C27, SNPs from all ESTs combined (1–27 inTable 1); SM12, SL7 and SC27, therespective sum of the 12 mandarins, 7 limes/lemons, and all 27cultivars.

Figure 3
figure 3

Comparisons between M12 vs. SM12, L7 vs. SL7, and C27 vs. SC27,respectively in three ratios. There are three ratios presentedas percentage, qSNPs, the number of quality SNPs; pSNPs, the number ofpotential SNPs; ESTs, the number of ESTs; contigs qSNPs, the number ofcontigs with qSNPs; contigs used, the number of contigs with >=4ESTs. M12, L7 and C27 are mined from grouped ESTs from the correspondingcultivars, and SM12, SL7, and SC27 summed from individually minedcultivars used in the grouped counterparts, respectively.

Alignment and distribution on the Clementine reference genome

A total of 25,417 25-mer sequences (query sequence, Additional file 2) with quality SNPs from all the 27 citrus cultivars wereused to align to the Clementine reference scaffolds (subject sequence) usingBLASTN at a cut-off e-value of 6e-004 (Table 2).2,947 sequences had “no hits found” and 22,470 one or more hits. Ofthe 22,470 SNPs with hits, 19,943 had only 1 scaffold hit with only 1 alignmenton the scaffold, 1,571 had 1 scaffold hit but >=2 alignments on thescaffold (3 alignments per scaffold hit on average), and 956 had >=2scaffold hits (~3 hits per oligo on average) with 1 or more alignments on eachof the scaffolds (~7 alignments per scaffold hit or ~20 alignments per oligo onaverage). It suggested the 19,943 25-mer oligo sequences appear to be unique inthe genome, and the remaining 2,527 25-mer sequences may have duplicated orsimilar sequences with at least 84% identities at different locations in thegenome. There was one extreme case that one 25-mer sequence from trifoliateorange yielded 29 scaffold hits and 2,162 alignments on all the scaffolds, thehighest numbers of all.

Table 2 BLASTN results of 25,417 25-mer oligo sequences

Taking these multiple scaffold hits and alignments into account, the total numberof scaffold hits was 24,293 with a total of 43,668 alignments on the scaffolds.Most had 100% (25/25) or 96% (24/25) nucleotide identities to those on thereference genome, accounting for 93% of all the alignments. Almost all thenucleotide discrepancies in the 24/25 alignments were at the SNP sites, which isan encouraging in silico validation of these SNPs. Of the total 24,293scaffold hits, 23,955 were on main scaffolds 1 to 9 (2,122, 2,804, 4,159, 2,813,3,045, 2,501, 1,861, 2,308, and 2,342, respectively), accounting for 98.6% ofthe total. The remaining 338 were on 87 small scaffolds. Figure 4 showed the distribution of SNPs with all and unique hitsfrom SO, TO, and CM on scaffold_1 of the haploid Clementine genome (similarfigures on scaffold_2 are in Additional file 3).According to the aligned SNP counts on each 500 kb, there were somefeatured regions (intervals in Figure 4). Forexample, in SO many fewer unique hits were found in the middle region, comparedto those in two arm regions. Relatively even distribution was observed in CM,with exceptions at Interval 5 with overwhelming duplicated hits of certain SNPs(similar to the same region in SO). There were very limited unique SNPs alignedat Interval 20–27 of all the three cultivars, suggesting the region maycontain the centromere, usually characterized by fewer genes. These results,combined with other criteria, should greatly facilitate selection ofwell-distributed core sets of SNPs across citrus genomes for differentgenotyping applications and genetic studies.

Figure 4
figure 4

SNP distribution on the Clementine reference genome, using Scaffold_1as an example. Each interval of the x-axis represented500 kb of the scaffold, and the y-axis represented the number ofSNPs in each 500 kb on the scaffold. SO – sweet orange(A); TO – trifoliate orange (B); CM –Clementine mandarin (C); “_a” – counts of allalignments generated by all SNPs; “_1” – counts ofSNPs of only 1 unique hit/alignment in the genome. Differences betweenthe “_a” and “_1” numbers are observed inseveral regions of each cultivar.

SNP validation by sequencing and SNaPshot genotyping assay

Of the 96 randomly selected sweet orange SNPs, 68 were validated by sequencingand 74 by SNaPshot in sweet orange (Additional file 4). There were 61 validated by both assays and the remainder validatedby only one assay. In other words, 7 were validated by only sequencing butfailed in SNaPshot, and 13 by only SNaPshot but failed in sequencing. Therefore,a total of 81 SNPs (84%) were validated by at least one of the two assays. Thehigh rate (84%) of validated SNPs was consistent with 93% alignments onto thereference genome with 100% (25/25) or 96% (24/25) identities (Table 2), indicating that QualitySNP, a haplotype-based SNP miningalgorithm and pipeline, is a very reliable tool to identify true EST SNPs, andit can effectively minimize the false discovery rate even without qualityfiles.

Discussion

Estimation of heterozygosity of different citrus genomes by haplotype-basedSNPs

Many naturally evolved genomes are heterozygous, and the heterozygosity level maybe evaluated by the rate of allelic nucleotide variations between the twohaplotypes [30]. SNPs, the most abundant polymorphisms in genomes, likely are themost appropriate index for the heterozygosity levels ofgenetically/taxonomically related genomes [19, 21, 22]. Given the different numbers and rates of haplotype-based SNPsdiscovered from these citrus individuals with substantial numbers of ESTs (forexample more than 5,000, Additional file 1), theratios of qSNPs/ESTs in most of them appeared reflective of their heterozygousstatus and genetic background. These hybrid derivatives had much higherqSNPs/ESTs ratio, while the other believed “pure” species had lowerratios. For example, some proven natural hybrid cultivars, such as SO, CM, andrecent hybrids such as SC, were among the higher qSNPs/ESTs ratios (SO - 5.23%,CM - 8.31%, and SC - 7.76%). Other presumed true species, including PM, fell inthe lower qSNP/ESTs ratios (PM - 0.60%). The number of needed ESTs to generatethe desired number of SNPs in given citrus genotypes, and vice versa, can beestimated. Such a tendency, along with the ratios and genome heterozygosity,could be strengthened and would be more conclusive if the numbers of ESTs in allthe cultivars were close to each other, or at least in a much smaller range.

SNP discovery and validation rates

SNP mining is no longer a bottleneck because computational capacity and sequencedata are exponentially increasing, and more SNP mining pipelines have becomeavailable in recent years [7, 8, 1215, 31]. Hundreds of thousands of SNPs can be easily mined out of EST orgenomic sequences. Inclusion of false SNPs in genotyping certainly is wasteful;therefore, maximizing the true SNP rate (minimizing the false rate) is the mostimportant consideration or requirement for a SNP mining algorithm because anyvalidation approach can only validate these true SNPs, but not false ones [8, 13]. We found that 93% of SNPs identified by the QualitySNP pipeline werealigned onto the reference genome at 25/25 or 24/25 identities, and 81% ofrandomly selected sweet orange SNPs were validated by sequencing and SNaPshotgenotyping. It was undetermined whether the others not aligned at the twoidentity rates, and not validated by sequencing and/or genotyping, were true orfalse SNPs. For example, those failing in sequencing validation might be due toSBE primer sequences not being found (likely an intron in the region), orsequencing failures caused by primers of low quality or in a variable region, orno nucleotide discrepancies at the sites. It was unclear how these SNPs failedin SNaPshot validation; it is speculated some of these SBE primers might beincorrectly positioned, i.e., the singly extended nucleotides may not have beenexactly at the SNP sites. There were a few such cases identified (Chen et al.unpublished data); very likely due to the differences between these consensuscontigs and the original haplotype sequences. On the other hand, only 2haplotypes may exist in a diploid genome. If SNPs were from the contigs withmore than 2 haplotypes, such cases could result from either ESTs mixed fromdiverse genotypes in the same species or highly identical paralogs assembledinto the contigs. Paralogous genes, resulting from genomic duplication andevolving into different functions, are very common in many genomes and remainalmost identical in their conserved regions. ESTs from different paralogousgenes, if assembled into a same unigene, could yield false SNPs that arenon-allelic and useless.

Criteria for selection of citrus core SNP sets

In most cases the discovered SNPs could easily reach a number so large that onlya small portion of them, designated core SNP set, are selected and used ingenotyping to meet the restraints in available budget, desired platform,applications, and other factors [3, 11, 3234]. These core sets of different numbers (e.g. 384, 1536, or othernumbers) are either required by certain SNP genotyping platforms or optimizedfor particular applications [3538]. It may be a daunting job, but it is necessary to establish workablecriteria to select any core set of different numbers of SNPs. Based on thiscomplete mining and validation process, several attributes of SNPs can be veryuseful and distinguishing to refine these core sets of different numbers. SNPoligo alignment uniqueness, identity percentage, and distribution in thereference genome, co-existence across different genomes, along with SNP types(nsSNP vs. sSNP, and transition vs. transversion vs. indel) and numbers pergene, should be the main criteria for selection of citrus core SNP sets. Aspointed out, some extra haplotypes might result from paralogs across differentgenome regions. In that case, the resulting SNPs would not be allelic or useful.Whether they mostly were those SNPs that had multiple scaffold hits andalignments remains unclear pending further investigation. Those SNPs from eithercircumstance should be excluded or at least deprioritized for use in genotyping.Selection of SNPs for genotyping could be difficult when different attributes ofSNPs and genotyping platforms are considered. A tool based on these attributesis being developed to achieve the automatic selection of core SNP sets fortargeted applications/platforms [35, 36] and to allow geneticists and molecular breeders to be able to selectand use certain core SNPs of interest from among the thousands discovered [37, 38]. All the SNPs (Additional file 2)identified in this work are being added to a citrus genome database(citrusgenomedb.org). Very recently after this study, another draft genome ofsweet orange was reported, yielding 1.06 million genome-wide SNPs, about 3.6SNPs/kb, which could be an additional valuable resource in SNP applications [39].

Conclusions

High-quality SNPs in public ESTs from different citrus genotypes were detected by theQualitySNP pipeline and compared to estimate the heterozygosity of each genome. Allthe short SNP oligo sequences were also aligned with the Clementine citrus genome todetermine their distribution and uniqueness in the genome and for in silicovalidation. Selected SNPs were also validated by SNaPshot and sequencing.