Introduction

Microsatellites, also known as simple sequence repeats (SSRs), have been widely utilized in molecular genetic studies for mapping, fingerprinting, genetic diversity, and phylogenetic reconstruction. These markers are characterized by a 1- to 6-bp core repeat that is tandemly repeated in the genome and are generally thought to arise by DNA slippage during DNA synthesis (Schlötterer 1998). These tandem repeats are ubiquitous and can be found in nuclear, chloroplast, and mitochondrial genomes. One of the main properties of microsatellite markers that makes them so widely used in genetic research is that the polymorphism level can be highly discriminating, sufficient enough to display unique, specific genotypes for each individual in a population from relatively few markers (Estoup et al. 2002). This extreme polymorphism is a consequence of the high mutation rates of these sequences, which allow variability in species otherwise characterized by low levels of genetic diversity (Peakall et al. 1998).

The precise manner in which microsatellite loci mutate is not clear and can differ by species or loci evaluated. Two molecular mechanisms are thought to play a role in the rapid formation of new alleles at microsatellite loci: unequal exchange in meiosis and slipped-strand mispairing in replication (Levinson and Gutman 1987; Valdes et al. 1993; Orti et al. 1997; Zhu et al. 2000). These mutational mechanisms can generate allelic homoplasy whereby alleles are identical in state (or length), but not identical by descent, and thus contain different sequence motifs. The amount of allelic homoplasy in microsatellite loci, however, seems likely to depend on various factors such as time since divergence and mutation rate. Homoplasy in microsatellite alleles causes apparent similarity, but in reality, masks true evolutionary differences (Angers et al. 2000) among alleles. This phenomenon has often been characterized as the “noise,” whereas homology can be characterized as the evolutionary “signal” (Estoup and Cornuet 1999). Apparent confusion between homology and homoplasy, which can easily occur in microsatellite studies based solely on allele size data, can potentially lead to inaccurate measures of genetic diversity, population divergence, relatedness, phylogenetic reconstruction, and inaccurate interpretation of population structure (Viard et al. 1998; Taylor et al. 1999). However, the exact effects that a given amount of homoplasy will have on the parameters used to describe population structure are not very clear and are difficult to predict (Rousset 1996; Orti et al. 1997).

Calculating genetic distance between individuals using microsatellite data depends on evolution models that aim to replicate the complex mutational process occurring at microsatellite loci (Buschiazzo and Gemmell 2006). Two molecular evolution models commonly used for the analysis of SSR markers are the stepwise mutation model (SMM) and the infinite allele model (IAM). The IAM postulates that each new mutation produces a unique allele (Kimura and Crow 1964), which allows for the creation of an infinite number of allelic states not already present in the population (Estoup and Cornuet 1999; Anmarkrud et al. 2008). On the other hand, the SMM assumes that there is equal probability of gaining or losing a single repeat unit within the microsatellite region to produce new distinguishable alleles. Unlike the IAM, SMM takes into account mutations back to a previous state (Anmarkrud et al. 2008). The SMM assumes that mutations result in alleles that have similar repeat units to the alleles from which they were derived. It further assumes that the differences in repeat units are informative in regard to the amount of time that has passed since the two alleles shared a common ancestor. Genetic distances based on the IAM, however, ignore this information (Goldstein et al. 1995).

Currently, the available data and simulations seem to suggest that most microsatellite sequences change in a stepwise manner (Valdes et al. 1993; Shriver et al. 1995; Zhu et al. 2000; Estoup et al. 2002). However, microsatellite alleles in maize do not always change in a stepwise manner. Most of the polymorphism in allele sizes detected was due to indels in the regions that flank the microsatellite repeat (Matsuoka et al. 2002). Twenty Arabidopsis microsatellite loci were evaluated and found to have complex mutational patterns that did not fit either the SMM or IAM consistently (Symonds and Lloyd 2003). Allelic size variation was also investigated in an intergeneric study of puma (Puma concolor) and the domesticated cat (Felis catus). This study showed that 80% of comigrating alleles between these two species displayed size homoplasy. The sequence differences between alleles of homologous puma and domestic cat microsatellite loci raised doubts about the accuracy of microsatellite-based phylogenetic comparisons between these distantly related mammalian genera (Culver et al. 2001). Overall, several exceptions to a strict SMM have been reported in which more complex mutations occurred among alleles, which altered the sequence content of alleles that were identical in state (Chen et al. 2002; Curtu et al. 2004; Hua et al. 2006; Lia et al. 2007).

Microsatellite markers have been the most commonly utilized markers in molecular biology for mapping, genetic diversity, phylogenetic construction, and fingerprinting because they are codominant, highly polymorphic, and easy to use. Frequently, these markers are developed from sequences containing repeat elements discovered in a particular species of interest, but employed across multiple related species or genera. Chen et al. (2002) demonstrated that allele size is an adequate measure of genetic difference when working with plants that are very closely related. However, when phylogenetic or evolutionary inferences are employed with distantly related species, then evaluation and verification of the SSR allele via sequencing is necessary because hidden motifs in alleles that are identical in state have been detected.

In 2006, the genetic diversity and phylogenetic relationships of multiple Citrus species and two related genera using a set of 24 microsatellite markers derived from C. maxima were assessed (Barkley et al. 2006). Therefore, the scope of this work was to verify the sequence content of citrus microsatellite alleles derived from 11 different species in three separate genera to examine the nature of variability among different-sized alleles, evaluate the extent of homoplasy among citrus SSR alleles in order to assess their utility for measuring phylogenetic relationships, and assess if the repeat motif is retained when using SSR markers over a broad range of taxonomically divergent Citrus species and related genera.

Materials and methods

Allele and taxa selection

Citrus and its close relatives are represented by 28 genera in the tribe Citreae of the subfamily Aurantioideae in the family Rutaceae (Swingle and Reece 1967). There are two commonly used classifications of Citrus: Swingle (Swingle and Reece 1967), and Tanaka (Tanaka 1977). Swingle lumps species together, recognizing 16 species in the genus Citrus, whereas Tanaka splits species, recognizing 162 Citrus species. The difficulty in classifying Citrus taxa is mainly due to repeated cross-pollination and to adventitious nucellar embryony, which stabilizes and perpetuates hybrid taxa (Scora 1975). Scora (1975) and Barrett and Rhodes (1976) suggested that there are only three “basic” true species of Citrus within the subgenus Citrus as defined by Swingle: citron (C. medica L.), mandarin (C. reticulata Blanco), and pummelo (C. maxima L. Osbeck). Nearly all Citrus species freely hybridize with one another, and thus, Mabberley (1997) suggests that taxonomic rank has been inflated due to the commercial importance of this crop and that only three species (C. medica, C. maxima, and C. reticulata) should be recognized for the subgenus Citrus. Other cultivated Citrus species within the subgenus Citrus are believed to be hybrids derived from these true species, species of the subgenus Papeda, or closely related genera, ideas generally supported by molecular marker data (Federici et al. 1998; Nicolosi et al. 2000; Barkley et al. 2006).

Given the taxonomy and prevailing theory on many Citrus species being derived by natural hybridization, alleles were carefully chosen from the data set of Barkley et al. (2006), which examined genetic diversity in a population of 370 citrus accessions and its relatives using SSR markers. Thirty-nine alleles were sampled from accessions considered to be “true” ancestral species within the subgenus Citrus [including C. medica (n = 9), C. maxima (n = 17), and C. reticulata (n = 13)], 5 alleles sampled from the subgenus Papeda, and 18 alleles sampled from their closest relatives, Poncirus trifoliata [trifoliates (n = 10)] and Fortunella spp. [kumquats (n = 8)] (Table 1). The remaining three samples were derived from hybrid taxa. The criterion for choosing taxa containing a particular allele was to maximize the number of different ancestral taxa (species) containing the allele of interest when possible. The citrus relatives were included to help evaluate if the repeat motif was conserved when crossing what is assumed to be more distant taxonomic borders. Additionally, alleles were selected that ranged from very low to high frequency (0.0108–0.8469) in the population (Barkley et al. 2006) to evaluate if allelic richness had any influence on intra-allelic variation (Table 2). In general, we did not sample known and probable hybrid taxa among naturally occurring forms since the goal of this study was to compare alleles derived from ancestral taxa. Thus, alleles chosen in this study were selected because they occurred frequently in putative ancestral taxa.

Table 1 A list of accessions used in this study along with their respective allele sizes that were cloned and sequenced from markers CCT01, cAGG9, and GT03
Table 2 Number of single-site polymorphisms observed at each allele size class

PCR, cloning, and sequencing of SSR alleles

The three loci used for this study were cAGG9, CCT01, and GT03, which had 6, 7, and 19 alleles, respectively, with polymorphic information content (PIC) values of 0.478, 0.247, and 0.834 in a study of 370 Citrus, Poncirus, and Fortunella accessions. PCR and gel electrophoresis conditions were performed as described previously (Barkley et al. 2006). All SSR alleles were cloned following the instructions in TOPO TA Cloning kit from Invitrogen (Carlsbad, CA). The ligation reaction consisted of 2 μl of PCR product, 0.5 μl of 1.2 M NaCl, and 0.5 μl of plasmid at a concentration of 10 ng/μl. The plasmid used was pCR 2.1-TOPO, which was provided by Invitrogen in the cloning kit, and contains Topoisomerase I from Vaccinia virus covalently bound to the vector that catalyzes the ligation of the PCR product into the vector. Chemically competent E. coli cells were transformed by adding the entire ligated product (3 μl) to TOP10 cells provided by Invitrogen. The cells were spread onto pre-warmed LB agar plates (1% tryptone, 0.5% yeast extract, 1% NaCl, 1.5% agar adjusted to pH 7.0) containing 40 mg/ml X-gal (scorable marker) and 50 μg/ml of kanamycin (selectable marker) and incubated overnight (12–16 h) at 37°C to allow colonies to develop.

Colonies were screened visually for a lack of color. Ten white colonies per plate were screened for the presence of an SSR allele by amplifying the plasmid with M13 primers included in the cloning kit. The PCR consisted of 5.4 μl of dH20, PCR buffer (1×), magnesium chloride (1 mM), dNTPs (0.2 mM), M13F and M13R (7.5 ng/μl), and a scraping of cells from a single colony. The thermocycling conditions included a 2-min denaturing step at 92°C for 1 cycle; 30 cycles of 92°C for 30 s, 52°C for 30 s, and 72°C for 1 min; and a final elongation cycle of 72°C for 7 min. The PCR products were separated on a 4% precast agarose E-gel (Invitrogen; Carlsbad, CA) and scored visually for the presence of an insert. Two size standards were run on each E-gel to determine the insert sizes (pGEM Promega; Madison, WI and 100-bp marker, Invitrogen). Positive colonies were grown in 3 ml of liquid LB media (1% tryptone, 0.5% yeast extract, and 1% NaCl adjusted to pH 7.0) overnight, and the plasmids were isolated following the instructions from a Qiagen (Valencia, CA) mini-prep kit. Plasmids were sequenced bidirectionally at the University of California, Riverside, Genomics Institute Core Instrumentation Facility using an ABI 3100 DNA sequencer (16-capillary). Multiple clones of a single allele were sequenced for over 38% of the selected alleles in this study to evaluate PCR and sequencing errors.

Sequence alignments and tree construction

AlignIR 2.0 (LI-COR; Lincoln, NE) was used to trim out the vector sequence and construct a consensus sequence of the forward and reverse sequence reads. No discrepancies between bidirectional sequences were noted in the insert region. All sequences were aligned using ClustalX (Thompson et al. 1997). Alignments were performed with low and high gap penalties. The setting used for the pairwise alignment parameter was 10.00 for the gap opening and 0.10 for the gap extension penalty. The multiple alignment parameters were set to have a gap opening of 10.00 and a gap extension penalty of 0.20. The pairwise alignment parameter was repeated using 100 for the gap opening and 7.5 for the gap extension penalties. The multiple alignment parameters were increased to 100 for the gap opening and 3.0 for a gap extension penalty. Changing the gap penalty parameters did not affect the sequence alignment. In the microsatellite region, the ClustalX alignment was also visually inspected and manually edited to minimize the number of gap locations.

Unweighted parsimony analysis with PAUP version 4.0 beta 10 was used to construct phylogenetic trees for each SSR marker (Swofford 2003). Parsimony searches all possible trees and evaluates each tree for the minimum number of mutations. This analysis performs well when convergence is rare, sampling is dense, and individual branches are short (Holder and Lewis 2003). Bootstrapping analysis that tests clade stability by resampling the data with replacements was conducted with 10,000 replicates. All gaps in the sequence were treated as a fifth base as opposed to being treated as missing. The sequence data were edited to reflect a 1-bp change for each respective trinucleotide or dinucleotide gap that occurred. This change ensures that a di- or trinucleotide gap in the repeat element is not treated as multiple characters/events, which would over-inflate the number of informative characters in each sequence. Additionally, gene genealogies (networks) were constructed from the sequence data (nexus format) of all the alleles at each marker by utilizing TCS version 1.13 (Clement et al. 2000). All gaps were treated as a fifth base. Sequence alignments were imported into DnaSP (DNA sequence polymorphism) version 3.5 (Rozas and Rozas 1999) to calculate statistics on DNA sequence variation such as π and θ for the three SSR markers studied.

Results

A total of 65 alleles mainly derived from ancestral Citrus species and two closely related genera (Poncirus and Fortunella) were cloned and sequenced from three SSR markers (Table 1). The relatives of Citrus were included to examine how often the microsatellite is conserved when crossing distant taxonomic borders. Even though Poncirus and Fortunella species can be hybridized with the genus Citrus, there is little evidence to suggest a long history of natural gene exchange between the genera Poncirus and Citrus. Furthermore, previous phylogenetic data based on molecular markers demonstrate that Poncirus, Fortunella, and Citrus are divergent (Nicolosi et al. 2000; Barkley et al. 2006; Pang et al. 2007), although cpDNA sequences place C. medica as more distant from other Citrus spp. than these related genera (Bayer et al. 2009). The alleles sequenced in this study were amplified using primers that targeted two trinucleotide repeat loci, one compound and one imperfect (cAGG9 and CCT01), and one imperfect dinucleotide repeat locus (GT03). The collected sequence data for each marker were used to create dendrograms to evaluate inter- and intra-allelic relationships via parsimony. Dendrograms were constructed utilizing the entire sequence (microsatellite region and flanking region) and the flanking region alone to evaluate if the repeat motif or the flanking region contributed in determining evolutionary relationships among alleles.

Locus CCT01

A total of 25 microsatellite alleles were cloned and sequenced from the trinucleotide locus CCT01 (Table 1). The indels observed in these sequences consisted of 3-bp repeats and occurred only within the microsatellite region. Size homoplasy in which alleles are identical in state but not identical by descent was detected in the alleles sequenced. For example, the 158-bp allele from CRC 644 pummelo had a compound repeat motif of (TCC)3(ACC)2(TCC)2 while the remaining 158-bp alleles had a slightly different imperfect repeat motif consisting of (TCC)3ACC(TCC)3. The 158-bp allele from CRC 644 pummelo could have either arisen from another 158-bp allele by a T-to-A point mutation or from a 164-bp allele by deletions of TCC from both sides of the ACC interrupt. Most of the 161-bp alleles had a repeat motif of (TCC)3(ACC)(TCC)4 followed by TCT, but one had (TCC)3(ACC)(TCC)3(TCT)2, which may have arisen by a C-to-T mutation, or by loss of a TCC and gain of a TCT, creating a new repeat motif. The numerous single-site mutations at this locus were the basis of several cases of apparent homoplasy.

The sequence data were employed to generate a gene tree that recognized eight characters that were parsimony informative and 17 variable characters that were uninformative. The tree (Fig. 1a) that resulted was not well resolved. The alleles in this tree did not segregate into several clusters of alleles of the same size class as would be expected for a microsatellite locus in which variation was due solely to change in repeat length. For example, the main polytomy of this tree included taxa with 155-, 158-, 161-, and 164-bp alleles, which clustered together even though there were differing numbers of trinucleotide gaps in these sequences. It is possible that a 9-bp gap between the smallest (155 bp) and the largest (164 bp) alleles, which was treated by PAUP as three mutational steps to reflect the trinucleotide repeat and very few informative characters, was not an adequate difference to sufficiently separate these four alleles.

Fig. 1
figure 1figure 1

a Strict consensus tree produced from the sequence data of alleles (155, 158, 161, and 164) from marker CCT01. b Strict consensus tree produced from the sequence data of alleles (103, 112, 115, and 118) from marker cAGG9. c Strict consensus tree produced from the sequence data of alleles (151, 153, 167, 171, and 173) from marker GT03. d Strict consensus tree produced from only the flanking sequences of microsatellite alleles derived from marker GT03. The names on the termini of all branches include allele size, CRC number, cultivar name, and taxonomic group, respectively

Locus cAGG9

Twenty-one microsatellite alleles amplified from locus cAGG9 were cloned and sequenced (Table 1). The interallelic size variation observed at this locus was due to indels within the microsatellite repeat. Very few cases of apparent homoplasy were observed at this locus. A tree was constructed from the sequence data obtained from this marker (Fig. 1b). Only six informative characters were identified, indicating very little sequence divergence, which could be a result of the high degree of cross hybridization and stabilization of hybrids via nucellar embyrony among Citrus. However, since hybridization and nucellar embyrony would apply equally to all loci examined and this limited sequence divergence was not observed in other loci, this may suggest that this locus is affected by stabilizing selection. Most of the sequence divergence detected appeared as trinucleotide gaps in the microsatellite region as indicated by the decrease in the number of parsimony informative characters (from six to one) when the microsatellite region was removed from the data set (data not shown). All of the alleles of the same size class clustered together and were unresolved, indicating that these allele sizes specify evolutionary relationships at this marker. All of the 118-bp alleles clustered together and could not be resolved. These alleles were derived from two trifoliate orange accessions (P. trifoliata), ancestral Citrus taxa including a mandarin (C. reticulata), and two pummelos (C. maxima), which are divergent based on taxonomic classification and phylogenies produced from molecular marker data (Federici et al. 1998; Nicolosi et al. 2000, Barkley et al. 2006). The sequence for this allele was completely conserved with no single-site polymorphism observed among these divergent taxa. This sequence conservation suggests that this allele may be ancestral, and thus, was present before the genera Poncirus and Citrus separated. Another, less likely possibility is that these distantly related taxa evolved the same derived characters independently.

The taxa with the 115-bp allele were chosen for this study because they were ancestral and classified in three separate species in the genus Citrus, and therefore, are taxonomically divergent. Accessions derived from these ancestral species separate into distinct clades in previous studies using SSR, RFLP, AFLP, or RAPD markers (Federici et al. 1998; Nicolosi et al. 2000; Barkley et al. 2006; Pang et al. 2007). The 115-bp sequences were fairly conserved with only three single-site polymorphisms observed within this allele class (Table 2). The next group, consisting of the five 112-bp alleles, was also unresolved in this tree and supported with a bootstrap value of 64%. Four of the five taxa with a 112-bp allele had no detectable polymorphisms when compared to one another. The last main group was the 103-bp alleles. All of the 103-bp alleles shared a transversion in the flanking region compared to the remaining allele size classes. The branch supporting these alleles was highly supported with a bootstrap value of 94% (Fig. 1b).

The microsatellite region was removed from all alleles produced at this locus to examine the effect of the polymorphisms in the flanking region and to determine how they influence the resolution of the tree. The resulting tree was much less resolved, and most of the alleles (112, 115, and 118) could not be distinguished from one another (data not shown). Since the resulting tree was less resolved than the tree with the entire sequence, this suggests that the allele sizes at this marker do indicate evolutionary relationships between sequences when the entire sequence is employed. However, in four of the five 103-bp alleles, the polymorphisms in the flanking region played a role in inferring the evolutionary relationships, since this allele segregated from the others examined due to a transversion observed in the flanking sequence. Moreover, this analysis demonstrates that contrary to other studies such as Rossetto et al. (2002), utilization of the flanking sequence for this locus would not be an effective strategy to deduce evolutionary relationships in citrus taxa.

Locus GT03

Nineteen alleles from locus GT03 were cloned and sequenced (Table 1). No indels were found in the regions flanking the microsatellite. However, there were several point mutations in these sequences both in the microsatellite region and the flanking sequence that produced multiple cases of homoplasy. Several of the point mutations detected in the alleles of this locus were genus/species specific or specific to a particular allele class. A dendrogram was constructed from the sequence data at this marker (Fig. 1c). This marker displayed the highest number of parsimony-informative characters with 18 informative characters. Many alleles of the same size clustered together and could not be distinguished. There were four main groupings in this tree, each consisting of a different allele size class (173, 171, 167, 153).

All of the 173-bp alleles, which occurred in the Citrus relative Poncirus trifoliata, were unresolved. The 171-bp alleles split into two groups. One cluster contained the 171-bp alleles produced from accessions classified in the genus Poncirus while the other cluster contained 171-bp alleles produced from accessions classified in the genus Citrus. This split would be expected based on taxonomic classification and phylogenies based on molecular marker data. The two 171-bp alleles from Poncirus accessions shared a C-to-T transition that was also observed in all of the 173-bp Poncirus alleles. This transition was genus/species specific, occurring only in P. trifoliata. As a group, the trifoliate oranges all tend to be similar to one another (Fang et al. 1997) and are divergent from the genus Citrus (Nicolosi et al. 2000; Barkley et al. 2006; Pang et al. 2007). This may explain why these alleles of different sizes share the same point mutation in the flanking sequence. Additionally, the 171-bp alleles displayed homoplasy because the 171-bp alleles produced from P. trifoliata accessions have an imperfect (GT)3TTCT(GT)14 repeat motif, whereas the 171-bp alleles derived from the genus Citrus have a compound (GT)3TT(CT)2(GT)13 repeat motif.

The 167-bp alleles were derived from four kumquats (two Fortunella hindsii and two F. crassifolia) and one pummelo (Citrus maxima). Transition mutations were detected within the microsatellite and flanking regions, respectively, that were specific to the genus Fortunella (kumquat), but not observed in the 167-bp allele from a pummelo (C. maxima). This produced two different microsatellite repeat motifs for the 167-bp alleles: (GT)3TTCT(GT)12 in pummelo and (GT)3TTCT(GT)3AT(GT)8 in kumquats. The four kumquat accessions had no detectable sequence divergence; therefore these alleles clustered together with a bootstrap value of 68% and were completely unresolved (Fig. 1c). This divergence caused the pummelo accession with a 167-bp allele to cluster separately from the other 167-bp alleles. However, pummelo (C. maxima) accessions are classified in a different genus than the kumquats (Fortunella); therefore one might expect this based on the taxonomy. Additionally, C. maxima and Fortunella spp. have typically clustered in separate clades in previous molecular marker studies (Federici et al. 1998; Nicolosi et al. 2000; Barkley et al. 2006).

The last main group on this dendrogram consisted of three 153-bp alleles and one 151-bp allele produced from citron (C. medica), mandarin (C. reticulata), and kumquat (F. hindsii) accessions. The three 153-bp alleles clustered together and were unresolved. The bootstrap value for this group was highly supported with a value of 100% (Fig. 1c). Homoplasy also was observed in the 153-bp alleles. The repeat motifs were different for each 153-bp allele with repeat motifs of (GT)3ATCT(GT)3(AT)2, (GT)3TTCT(GT)3(AT)2, and (GT)3TTCT(GT)4AT. The 153-bp alleles had a few point mutations (C-to-G and A-to-C transversions in the flanking region, G-to-A transition in the microsatellite) that were specific to this allele size class and occurred in two separate genera and species (Fortunella hindsii and C. reticulata). In a population structure analysis (Barkley et al. 2006), CRC 2867 C. reticulata was found to be a hybrid having approximately 60% of its alleles derived from kumquats and only 40% from mandarins, which may explain why these accessions classified in separate genera share allele-specific mutations.

The 151-bp allele clustered with the 153-bp alleles (Fig. 1c). This 151-bp allele derived from ‘Japansche’ (CRC 2875) has lost the TTCT interrupt contained within the microsatellite that all other taxa share and has more GT repeats than the 153-bp alleles. This 151-bp allele demonstrates an exception to the stepwise mutation model. One would expect that a dinucleotide repeat microsatellite locus evolving in a purely stepwise manner would contain a 2-bp deletion of a repeat motif when comparing a 153-bp allele to a 151-bp allele. However, it is also possible that the 151-bp allele is the ancestral allele and all the other alleles gained the TTCT interrupt contained within the microsatellite.

The allele sequences were edited to remove the microsatellite to examine the influence of the flanking sequence on the evolutionary relationships among taxa (Fig. 1d). Once again, the number of parsimony informative sites was drastically reduced, from 18 to 5, suggesting that the majority of the variation is contained within the repeat motif. Even though removing the microsatellite region reduced the number of parsimony informative sites, the resulting tree was fairly similar to the tree obtained with the entire sequence in which alleles of the same size class clustered together. All of the 173-bp alleles (all P. trifoliata taxa) and two of the 171-bp alleles (P. trifoliata taxa) clustered together and were undistinguishable due to a shared parsimonous site (C-to-T) in the flanking sequence, which would be expected based on taxonomy and phylogenies based on molecular marker data (Nicolosi et al. 2000; Barkley et al. 2006; Pang et al. 2007). Four of the five 167-bp alleles derived from the citrus relatives clustered together and were unresolved. The other 167-bp allele from a pummelo (C. maxima) accession clustered with the remaining 171-bp alleles produced from ancestral Citrus taxa. Since the overall clustering pattern was similar to the tree produced with the entire sequence, this would imply that the flanking sequence polymorphisms significantly contributed to the evolutionary relationships between these allele sequences. This further suggests that the flanking sequence of GT03 could be used effectively as a marker to deduce evolutionary relationships between taxa.

Networks/gene genealogies

TCS version 1.13 (Clement et al. 2000) was employed to produce a gene genealogy or network for each marker. This program reduces the data set by eliminating duplicate haplotypes and calculates the probability of parsimony for all pairwise combinations until the probability surpasses 0.95. The output from TCS was used to diagram a gene genealogy for each marker (Fig. 2a–c). The alleles used in this study were not randomly sampled, and thus, the ancestral allele chosen by TCS probably does not represent an actual ancestral allele. Locus CCT01 had the most complex network (Fig. 2a) with relatively few duplicate haplotypes. Only 4 alleles out of 25 had identical haplotypes at locus CCT01. The 158-bp allele produced from ‘Indian’ citron CRC 661 was picked as the haplotype from which the other haplotypes could be derived with the least amount of change. This analysis also suggests that the 158-bp alleles from CRC 644 pummelo may not have arisen in a stepwise manner, but instead gained and lost a repeat element from the 158-bp ancestral haplotype CRC 661 or possibly lost two repeat units from the 164-bp allele. Locus cAGG9 had a fairly simple network (Fig. 2b), probably due to the comparatively few point mutations observed. In contrast to marker CCT01, many of the alleles of the same size class had identical sequences, and thus, were reduced to a single haplotype such as the 118-bp allele. Even though locus GT03 had many cases of homoplasy and numerous parsimony informative sites (discussed previously), the network for GT03 was also fairly simple and several alleles of the same size class were reduced to a single haplotype (Fig. 2c).

Fig. 2
figure 2

a CCT01 network/gene genealogy constructed from TCS 1.13 (Clement et al. 2000). b cAGG9 network/gene genealogy. c GT03 network/gene genealogy. The branches are labeled to denote the changes between alleles, and the arrows denote the direction of the change shown on the diagram. Each allele size is denoted by a different shape (e.g., −158 bp = oval). The text within the shapes denotes the allele size, CRC number, and a letter (C citron, K kumquat, M mandarin, P pummelo, Pa papeda, R rangpur, and T trifoliate) to denote the group in which the particular CRC number is classified. The order of mutations denoted on branches is arbitrary

Single-site polymorphism

The total number of single-site polymorphisms was calculated by examining the number of point mutations that occurred in each allele size class (Table 2). The 151-bp allele at locus GT03 was not included because only one 151-bp allele was cloned. Locus CCT01 had the most single-site polymorphisms with a total of 22 among the four allele size classes. On the other hand, locus cAGG9 had the fewest total single-site polymorphisms with seven observed for all four alleles. The 118-bp allele produced by primers targeting locus cAGG9 was the only allele in which no single-site polymorphisms were observed, indicating high sequence conservation. This allele also had a relatively low frequency of 0.038 in a population of 370 citrus accessions (Table 2), which may be why this sequence was completely conserved. The 158-bp allele at locus CCT01 had 10 single-site polymorphisms. This was the most observed in any allele size class and it also had the highest allele frequency (0.8469) of all the alleles targeted, suggesting that alleles that are common in the population may have a higher substitution rate or represent more ancient alleles. For loci CCT01 and cAGG9, the number of single-site mutations generally increased with allele frequency (Fig. 3). However, more alleles of different frequencies would need to be included to validate this trend. Single-site polymorphisms were also calculated for each locus examining all the sequence data collected per marker as opposed to evaluating each allele size class. The total number of single-site polymorphisms for all three loci was 30 with 16, 7, and 6 single-site polymorphisms observed at markers CCT01, GT03, and cAGG9, respectively.

Fig. 3
figure 3

Scatter plot showing the general relationship of allele frequency (x-axis) compared to the number of single-site mutations (y-axis) for each of the three markers used in this study

Frequency of gaps and substitutions

The frequencies of gaps and substitutions were compared for all of the alleles at each of the three microsatellite loci examined to determine if gaps or base substitution mutations were more frequent. The alleles of marker cAGG9 had very few base substitutions (11) in comparison to the alleles produced from markers GT03 and CCT01, which had 40 and 34 base substitutions, respectively. Therefore, the frequency of base substitutions was 0.46, 0.83, and 1.26 per 100 bp for cAGG9, CCT01, and GT03. The mean frequency of base substitutions for all three loci was 0.85 per 100 bp. The number of trinucleotide gaps that were observed in the various sized alleles for markers cAGG9 and CCT01 were 41 and 37, respectively. Marker GT03 had a total of 61 dinucleotide gaps for all the alleles targeted at this locus. Therefore, given the total number of bases sequenced at marker cAGG9, the gaps occurred 3.73 times more frequently than the base substitutions. However, for markers GT03 and CCT01 the gaps occurred only 1.53 and 1.1 times more than base substitutions, respectively. Therefore, it appears that insertion or deletion of repeat elements occurs more frequently than base substitutions for the alleles at these three microsatellite loci assuming that changes in allele size are due to the addition or deletion of one repeat element at a time.

Sequence variation

DNA polymorphism from nucleotide sequence data generated from the three microsatellite markers was examined by using the program DnaSP version 3.5 (Rozas and Rozas 1999). The average number of different nucleotides per site between sequences (π) ranged from 0.00730 for marker cAGG9 to 0.01624 for marker CCT01 with a mean value of 0.01324. The nucleotide diversity (π) for marker GT03 was 0.01618. The average number of nucleotide differences (K) for cAGG9, GT03, and CCT01 was 0.781, 2.281, and 2.533, respectively. Theta (θ) was calculated per site from the number of polymorphic sites and also calculated per site from the total number of mutations. The values of θ calculated per site from the number of segregating sites for markers cAGG9, GT03, and CCT01 were 0.01299, 0.02029, and 0.03565, respectively. The values of θ per site estimated from the total number of mutations were 0.01559, 0.02029, and 0.03565 for cAGG9, GT03, and CCT01, respectively. In general, these values calculated for sequence variation can only be considered estimates due to the selection of samples based on allele frequency and maximizing ancestral taxa, which included multiple genera, species, and a few hybrids. Further work could include analyzing multiple individuals from a single taxon to obtain precise measures of sequence variation.

Discussion

Microsatellite markers are frequently used in molecular genetic studies because they are codominant, polymorphic, and ubiquitous in eukaryotic genomes. These markers have been extensively used for assessing genetic diversity, fingerprinting, determining parentage, forensics, construction of genetic linkage maps, and phylogenetic analysis. SSR markers can be effective tools for most all of these research objectives; however, due to extensive homoplasy, they may fail or lead to incorrect conclusions in phylogenetic analysis when evaluating divergent intraspecific, interspecific, or intergeneric relationships. In principle, homoplasy is tightly linked to the mechanisms that cause mutations that produce new alleles, and hence, homoplasy is also coupled to the underlying mutation model (Lia et al. 2007). The amount of homoplasy in microsatellite alleles is generally thought to increase with increasing time of divergence among taxa (van Oppen et al. 2000) and increasing allele size since longer repeats are less stable than shorter repeats (Anmarkrud et al. 2008). Previous interspecific studies of the sequence content of microsatellite alleles have revealed prevalent homoplasy including loss of the targeted repeat motif; hence, it has been suggested that these markers may not be useful for phylogenetic analysis above the species level (Ochieng et al. 2007; Tesfaye et al. 2007). However, since genetic distance measures (used for phylogenetic construction) are averaged over several loci, a few cases of homoplasy would probably not invalidate the average relationships between taxa, but the effect or amount of allowable homoplasy in microsatellite alleles for phylogenetic construction is currently unknown.

Genetic distance measures used to construct a phylogeny assume that allelic size class is an indication of phylogenetic affinity (Orti et al. 1997). Furthermore, phylogenetic reconstruction is based on the assumption that mutations between individuals increase as the time increases since they diverged from a common ancestor (Holder and Lewis 2003). Therefore, if microsatellite alleles arise by convergent or parallel evolution in which different lineages acquire the same trait, revert back to their ancestral states, or do not contain similar sequence content for alleles that are identical in state, the ability to infer patterns of evolutionary history can be affected (Adams et al. 2004). Measuring homoplasy caused by convergent evolution can be difficult without evaluating mutations in known pedigrees. (Currently, extensive pedigree information in citrus is limited). On the other hand, homoplastic alleles that are identical in state (IIS), but not identical by descent (IBD), and thus, contain different sequences for the same sized alleles can be easily evaluated by determining the sequence content. Since allelic homoplasy and hidden motifs within alleles have been demonstrated in several microsatellite studies (Grimaldi and Crouau-Roy 1997; Primmer and Ellegren 1998; Viard et al. 1998; Culver et al. 2001; Hale et al. 2004), one needs to be careful in interpreting results from microsatellite data based solely on allele size data particularly for distantly related taxa.

Because citrus taxonomy can be somewhat debatable and microsatellite markers are suggested to be employed only for intraspecific relationships, our goal in this study was to evaluate what changes might exist in these alleles over what is assumed to be divergent citrus taxa (based on current taxonomy and previous molecular marker studies). The main obstacles in classifying citrus are disagreement on whether hybrids among naturally occurring forms should be assigned species rank (Roose et al. 1995), repeated cross pollination among taxa, and nucellar embryony, which perpetuates hybrid taxa (Scora, 1975). The results from sequencing microsatellite alleles in Citrus spp. and its two closest relatives showed that the expected repeat motifs were present, albeit sometimes slightly modified, even when evaluating alleles derived from separate taxonomic genera and species. Since most Citrus taxa can freely cross with one another, conservation of microsatellites among interspecific accessions is not too unexpected, whereas studies of other species and genera have not always observed microsatellite repeat preservation when examining distant taxonomic relatives (Chen et al. 2002). Moreover, the microsatellite alleles generally, but not always, provided information about their relatedness in that same sized alleles clustered together; although they did not always display identity between alleles of the same size class. This suggests that employing microsatellite markers in the genus Citrus, Poncirus, and Fortunella may generate valid phylogenetic inferences when calculating genetic distances using mutation models that assume some homoplasy may occur. [Currently, several mutation models used for the analysis of microsatellite markers to calculate genetic distance such as the stepwise mutation model, K-allele model (Kimura 1968), and the two-phase model (Di Rienzo et al. 1994) assume that some homoplasy may occur; however, the infinite allele model does not take into account that homoplasy may occur (Estoup et al. 2002)]. Additionally, since preservation of microsatellite motifs among distant species or genera is not always typical (Chen et al. 2002), this may suggest that as hypothesized by Mabberley (1997) and Bayer et al. (2009), species and genera rank may be over-inflated due to the commercial value of citrus. However, because homoplasy is thought to increase with time of divergence, it is also possible that the divergence time between species and genera has not been extensive enough to allow mutations to accumulate, and thus, significantly alter the sequence of these microsatellite alleles. Another possible explanation could be that these microsatellites are somewhat stable in distant citrus taxa due to their small size since larger repeat motifs have been shown to have more hidden sequence motifs than shorter alleles (Anmarkrud et al. 2008). Consequently, one may be able to control or reduce the amount of homoplasy in microsatellite alleles by intentionally selecting microsatellites with shorter repeat elements.

In conclusion, this work demonstrates that variation among microsatellite alleles in the genus Citrus and two related genera (Poncirus and Fortunella) were fairly consistent with the stepwise mutation model. Interallelic variation in all of the targeted alleles at these three loci with one notable exception could all be explained by an expansion or contraction of repeat units. No indels were detected in the flanking sequence as seen in several previous studies (Orti et al. 1997; Makova et al. 2000; Matsuoka et al. 2002). This work suggests that microsatellites can be a useful tool for evaluating Citrus species and two related genera since repeat motifs were reasonably well retained. Homoplasy was detected at all three loci but was most prevalent in markers GT03 and CCT01; consequently, the number of microsatellite alleles is clearly an underestimate of the number of sequence variants present. Therefore, this suggests that allele size data do not always represent the true level of genetic diversity present in Citrus and two related genera. In general, as the allele frequency increased in the population so did the number of single-site mutations, which in turn generated some of the observed homoplasy; however, more work needs to be done with a range of alleles at different frequencies in the population and the inclusion of more markers to validate this trend. In addition, sequencing these alleles demonstrated new genetic variation, some of which was specific to certain genera or species that would not have been revealed based on size alone, which with further testing could be used to develop SNP markers to distinguish individual accessions or a particular species. Overall, this study along with others adds to the growing body of evidence that microsatellite alleles that are similar in size are not necessarily characterized by identical sequence content or do not necessarily contain the expected microsatellite repeats. Thus, careful examination of the sequence content of alleles should be performed prior to making any conclusions about the assumed evolutionary relationships between accessions.