Background

Microsatellites are direct tandem repeats of 1–6 base pair sequence motifs, often strung together in long arrays. They occur much more commonly than expected by chance in the genomes of all eukaryotes [13]. The reasons for this are not yet fully understood, but increasing evidence indicates that many microsatellites are functionally important in regulating gene expression [410] and possibly also meiotic recombination [1114]. Microsatellites are also of interest because of their widespread use as genetic markers for applications in genome mapping [1517], gene hunting [1820], forensics [21], deducing kinship [22], population genetics [2325] and the study of the evolution of species [2628]. These applications depend on assumptions about microsatellite evolution that, at present, are overly simplistic because of unexplained heterogeneity in mutation rates between loci, and an increased understanding of microsatellite evolution and mutational mechanisms is therefore being sought (reviewed in [29, 30]).

Slipped strand mispairing during DNA replication is currently thought to cause most microsatellite mutations [31], but it has also been proposed that unequal meiotic recombination could drive microsatellite evolution [32]. Recombination has been demonstrated to cause instability of some microsatellite loci implicated in human disease (reviewed in [33]), but evidence has counted against it being considered a significant factor in microsatellite evolution. Microsatellite instability was not found to be reduced in recombination deficient strains of E. coli [34] or S. cerevisiae [35] and similar microsatellite mutation rates have been reported for the non-recombining human Y chromosome and the autosomes [3638]. Also, no association has been found between microsatellite variation and recombination rates on scales of several hundred thousand base pairs in humans [39, 40]. Recent evidence has shown, however, that meiotic recombination events predominantly occur in narrow hotspots of 1–2.5 kilo bases (kb) separated by as much as 50–100 kb of DNA that very seldom recombines [4144]. Data about the relationship between microsatellites and recombination hotspots at this narrow scale are sparse, and there are some signs that it merits further investigation. A poly-AC array inserted near a recombination hotspot in S. cerevisiae mutated with high frequency [12], and it has recently been found that polymorphic microsatellites are over-represented in human hotspots [45]. There is also some evidence that microsatellites could have a role in regulating hotspot recombination [1114], increasing the relevance of studying their association with hotspots, since the basis in sequence of the control of hotspot locations is not yet well understood [4244, 4648].

It has been shown previously that microsatellite frequencies correlate with broad scale recombination rates in rats, mice and humans [49]. Microsatellites are also associated with intermediate scale recombination rates [50], as well as hotspots in their narrowest known sense [43] in the human genome. So far, however, these studies have reported little detail about the relationship between recombination hotspots and microsatellites. An ideal model organism in which to further examine the association is the yeast S. cerevisiae, since it is the simplest eukaryote, and recombination hotspots have been mapped throughout its entire genome [42]. Factors that could complicate an association between microsatellites and recombination are likely to be less problematic in yeast since, for example, the locations of genes and their expression levels have been well-characterized, making it possible to control for the links between microsatellites, recombination and transcription. Also, transposable or other known repetitive elements are not likely to mediate a link between recombination hotspots and microsatellites in yeast, since these elements are not enriched in yeast hotspots [42], as they are in human hotspots [43].

We investigated in detail the association between microsatellites and hotspots of meiotic double-strand breaks (DSBs), the precursors of meiotic recombination, throughout the S. cerevisiae genome [42]. As well as long microsatellite arrays, we considered low copy number repeats, which have not been studied previously in relation to recombination, including those with only two copies. This allowed us to address the question of whether recombination is involved in the origin of microsatellites, which has previously been considered to occur mainly by accumulation of random point mutations [51]. An association between low-copy microsatellites and hotspots would suggest the involvement of recombination as a mutational mechanism in microsatellite evolution, since replication slippage is expected to act with significant frequency only on arrays of at least six copies [5254], and there is no available evidence to suggest that short microsatellites have the potential to stimulate recombination.

We found several types of microsatellite to be strongly associated with recombination hotspots in S. cerevisiae, with levels of enrichment greater than two-fold. The associations are, however, stronger for longer microsatellites, and weak or absent for repeats with less than six copies. Our findings suggest that the link between microsatellites and recombination deserves further experimental exploration.

Results

We used hotspot locations mapped by Gerton and co-workers throughout the S. cerevisiae genome using microarray analysis of meiotic DSB frequency [42]. This study identified 177 hotspots, which encompassed all previously known meiotic recombination hotspots in the species, and 40 coldspots. For the purposes of our analysis, we extended the hotspots and coldspots to include the intergenic regions (IGRs) adjacent to the open reading frames (ORFs) identified by Gerton and co-workers [42], since yeast hotspots are typically centred on IGRs, in which most DSBs occur [55]. The hotspots as we defined them have a mean length of 3466 bp. The principal statistical comparisons we made were between hot and non-hot, rather than hot and cold regions, since the cold regions are too few to provide a reliable enough picture of microsatellite density, and recombination frequencies are very low in all experimentally tested regions outside hotspots [41, 44].

In general, numbers of repeats are very much lower in ORFs than IGRs, (Table 1), despite the fact that ORFs cover 73.5% of the genome. This is not surprising, since array length change mutations in microsatellites other than tri- or hexanucleotide repeats would cause frame-shifts in ORFs, destroying gene function. Short (3–5 bp) mononucleotide runs have similar frequency in ORFs and IGRs, but this is likely to be due to coding sequence such as AAA (Lys), GTTTTA (Val Leu), GGG (Gly) or AGGGTT (Arg Val), because the vast majority of the short mononucleotide repeats genome-wide are only three bp long. When making comparisons between hot and non hot regions, we accounted for the low microsatellite abundance in ORFs by comparing ORFs exclusively with other ORFs, and IGRs only with other IGRs. We found the abundance of short-motif, AT-rich repeats to be dramatically higher than other repeat types throughout the genome, so we divided microsatellites by motif length as well as by array length in order not to lose information about longer motifs. We also separated poly-A from poly-G. Nineteen physically independent categories of motif and array length were used in total (see Methods section).

Table 1 Total number of microsatellite repeats and percentage of regions with at least one repeat in the S. cerevisiae genome. The e value denotes the number of bases in any part of a repeat within which no more than one mismatch was allowed with respect to the consensus motif. A lower e value therefore results in the detection of more imperfect repeats.

High microsatellite frequencies in meiotic recombination hotspots

Microsatellite frequencies in meiotic recombination hotspots and non-hot regions of the S. cerevisiae genome can be found in Additional file 1, Tables S1 and S2. Several types of microsatellite have significantly different frequency in hot than non-hot areas (alpha, adjusting for Bonferroni's correction = 0.0026, Table 2). Repeat frequencies in the 40 coldspots are generally lower than in other non-hot regions, but these differences are not statistically significant (Additional file 1, Tables S1 and S2). The correlation between DSB intensity level, assayed for all yeast ORFs by Gerton and co-workers [42], and microsatellite frequency, is generally weak (Additional file 1, Tables S3 and S4), but several repeat types, especially long poly-A and dinucleotide microsatellites, are markedly more abundant in hotspots than non-hot regions (Figure 1, Table 2).

Figure 1
figure 1

Frequencies of high-copy, short-motif repeats in yeast intergenic regions. Mean microsatellite frequencies in S. cerevisiae IGRs divided according to DSB intensity into 473 hot, 89 cold and 5431 other regions, which were all IGRs not categorized as either hot or cold. Poly-AT arrays comprised the majority of dinucleotide repeats and are highlighted in grey. Error bars are plus and minus one SEM.

Table 2 Microsatellite types with a significant difference in frequency either between hot and non-hot IGRs, or hot and non-hot ORFs, in the S. cerevisiae genome. Significance was inferred where p < 0.0026, with the level of alpha adjusted for 19 independent classes of repeat using Bonferroni's correction. The Mann-Whitney U Test or T Test was used, depending whether samples were normally distributed. The e value denotes the number of bases in any part of a repeat within which no more than one mismatch was allowed with respect to the consensus motif. A lower e value therefore results in the detection of more imperfect repeats.

Of the types of microsatellite we investigated, mononucleotide runs are by far the most common, and long arrays are highly over-represented in hotspots. Although poly-A (n ≥ 6) is less than 28% enriched in hot IGRs, and is more common in non-hot than hot ORFs, poly-A (n ≥ 14) is between two and two and a half fold more common in hot IGRs, and poly-G (n ≥ 14) is nearly five fold over-represented, though this figure may be misleading as numbers of poly-G arrays are very low (Table 1). We used a lower limit of 14 bp to define long mononucleotide arrays, since a 14 bp poly-A tract was previously found to influence the activity of the S. cerevisiae ARG4 meiotic recombination hotspot [11]. Short poly-G runs are somewhat enriched in hotspots, and short poly-A is under-represented, but these differences can partly be explained by elevated GC content in hotspots, which has been shown previously [42], since correlations between DSB intensity and short mononucleotide runs are up to 50% weaker for IGRs, and are almost completely absent for ORFs, when controlling for GC content using partial correlation analysis (Additional file 1, Tables S3 and S4). For long microsatellites other than poly-G, correlations with DSB intensity are generally increased when controlling for GC-content (Additional file 1, Tables S3 and S4).

Dinucleotide repeats of six copies or more, and especially those with ten copies or more, are strongly associated with both hot IGRs and hot ORFs, with poly-AT the most abundant type of repeat involved (Figure 1, Table 2). Trinucleotide repeats of more than six copies are approximately twice as frequent in hot than non hot IGRs (p = 0.0027 Mann-Whitney U Test). This association is not quite significant when using the conservative Bonferroni correction for multiple hypotheses (alpha = 0.0026, see Methods section), but trinucleotide microsatellites are much scarcer than mono- or dinucleotide repeats in the yeast genome (Table 1), so statistical power to detect effects on their distribution is lower.

More marginal associations are present for some other repeat types. Long hexanucleotide microsatellites are many fold more frequent in hot than non-hot ORFs (p < 0.0001, Mann-Whitney U Test; Table 2), but this should be considered in view of the very small numbers of hexanucleotide repeats throughout the genome (Table 1). Dinucleotide repeats with between three and five copies are also significantly over-represented in hot compared with non hot IGRs, but levels of enrichment are much lower than for longer microsatellites (Table 2). Frequency of two-copy repeats is not significantly different in hot compared with non hot regions, despite the great abundance of these repeats relative to longer microsatellites, and the consequent high statistical power. Tetra- and pentanucleotide microsatellites show no significant associations at all, but these repeat types are relatively very rare throughout the yeast genome (Table 1).

Properties of hotspot-associated microsatellites

We examined repeat array length and purity (number of mismatches with respect to the consensus repeated motif) for microsatellites of at least six copies in hotspots and other regions of the yeast genome. In addition, we compared the frequencies of insertion, substitution and deletion mismatches, with respect to the consensus repeated motifs, between hotspot-associated microsatellites and those in other regions. We found that poly-A and poly-G arrays are significantly longer in hot IGRs, and mismatched dinucleotide repeats of at least six copies are significantly longer in hot ORFs, but we saw no other significant differences in repeat length (Additional file 1, Tables S7 and S8). Microsatellites in hot and non-hot regions do not differ significantly in purity, but dinucleotide repeats in non-hot regions do show an elevated proportion of deletion mismatches (p = 0.0006, Mann-Whitney U test).

We looked the sequence motifs of all microsatellites with repeat period between three and six to see if any particular motifs were associated with hotspots. No obvious associations were seen, but we did note that poly-purine/poly-pyrimidine motifs with only one G or C are clearly over-represented among the most common motifs for low copy repeats in both hot and non-hot regions (Additional file 1, Tables S9–S12). This is likely to be related to the enrichment of poly-purine/poly-pyrimidine tracts (PPTs) in the genome as a whole [56], and, as we have reported previously, PPTs with internal tandem repeats comprise only a small proportion of total PPTs [57]. The GC-content of all repeats with at least six copies is strikingly low in IGRs throughout the genome, but there are no significant differences between hot and non-hot regions for microsatellite GC-content (Additional file 1, Tables S5 and S6).

Possible complicating factors

The influence of microsatellites on transcriptional frequency [410], and the mutagenic effect of transcription on microsatellites [58] suggested that factors relating to gene expression could affect microsatellite distribution. Theoretically, this could drive the association between microsatellites and recombination hotspots in yeast, since transcriptional frequency (vegetative cells [59]) correlates with DSB intensity (p < 0.0001). However, looking at the "hottest" regions for transcriptional frequency (in equivalent numbers to the numbers of recombination hot regions studied), we found that the number of these that overlap with recombination hotspots is lower than random expectation, and the correlations between DSB intensity and frequency of microsatellites change very little when controlling for transcriptional frequency in partial correlation analysis (Additional file 1, Tables S3 and S4). DSBs have been shown to be more frequent in IGRs with two promoters (divergent transcription of flanking genes) than those with one (parallel transcription of flanking genes) or none (convergent transcription of flanking genes) [42]. We found that densities of some types of microsatellite do differ between IGRs with different numbers of promoters (Table S13). Significant differences are not present for longer microsatellites, however, with the exception of dinucleotide repeats, which are more common in IGRs with no promoters, though not significantly so when testing hot IGRs only. The association between poly-A and hotspots is not due to factors relating to the poly-A adenylation signal present in 3' untranslated regions (UTRs), since the level of enrichment of poly-A in hot over non hot IGRs does not differ by more than 5% between regions with zero, one and two promoters (two, one and zero 3' UTRs respectively).

Another factor that could complicate the association between hotspots and microsatellites is complex (tightly bunched or highly degenerate) repeats. Our initial analysis left open this possibility, since our repeat-finding algorithm does not allow multiple consecutive mismatches within single microsatellites. We therefore looked at numbers of repeats within five and ten bp of other repeats, and compared levels between hot and non-hot regions (Additional file 1, Tables S14 and S15). We found that numbers of microsatellites within complex repeats in IGRs are similar in hot and non-hot, or somewhat higher in non-hot, regions. Degenerate or complex repeats do not, therefore, affect the association between microsatellites and hot IGRs. In ORFs, complex repeats are generally somewhat more frequent in hot regions, however, and this is the case for one repeat type that showed significant over-abundance in hot ORFs, namely long dinucleotide repeats of at least six and at least ten copies. This raised the question of whether the association between this type of repeat and hot ORFs is due to the presence of highly mismatched repeats counted multiple times by our repeat finder. We therefore repeated the analysis with dinuclceotide microsatellites in ORFs occurring within 5 bp of other dinucleotide microsatellites grouped together as single arrays. This did not change the results for repeats with at least 10 copies, which still showed a strong association with hot ORFs (p < 0.0001). It did, however, reduce the significance of the association between dinucleotide arrays of at least six copies and hotspots, raising the p value to 0.014, which is above our alpha level. In view of this result, we removed dinucleotide repeats with at least six copies from our list of repeat types associated with hot ORFs.

Microsatellite frequencies in hotspot flanking regions

We reported previously that PPTs are enriched in hotspot flanking regions as far as two ORFs removed from hotspots [57]. We repeated the analysis for microsatellites, but found no consistent evidence for a similar regional enrichment (Additional file 1, Tables S16 and S17). This suggests that the association with recombination hotspots is less broad in scale for microsatellites than for PPTs. It is also possible, however, that the lower relative abundance of microsatellites could obscure a more general broad scale association than we were able to detect, since several repeat types have higher mean frequencies in hotspot flanking regions but are too sparse for statistical significance. Furthermore, since microsatellites are enriched in both hot IGRs and hot ORFs as defined by the DSB map by Gerton et al., [42], and recombination breakpoints mapped on the finest possible scale are concentrated almost entirely in IGRs in yeast [55], the relationship between microsatellites and recombination probably is distal to some degree.

Discussion

The level of enrichment of microsatellites in yeast recombination hotspots we have detailed here is considerably greater than has been seen for human hotspots [43, 45]. It is not clear why this should be the case, but it is notable that the association between microsatellites and recombination in mammals is quite marked when considering broad scales of several hundred thousand kilo bases or more [49, 50]. In view of evidence that humans and chimpanzees do not share a large proportion of hotspot locations in common [60, 61], one explanation for the discrepancy could be that hotspots do not stay in one place long enough, in these species, to leave strong local imprints in the form of simple sequences generated by hotspot-associated factors, but that hotspot density is more constant on a larger scale. Lower lability of yeast hotspots in evolutionary time could therefore, in theory, have resulted in the stronger associations we have seen.

A better-characterized difference between the yeast and human genomes, which could also contribute to the difference between the two species in the level of association between hotspots and microsatellite abundance, is the vastly greater amount of non-coding DNA in humans. Yeast intergenic regions are small, averaging only just over 500 bp, and 75% of them contain promoters. Potentially, this could complicate the association between recombination hotspots and microsatellites due to the links between microsatellites, transcription, and recombination. Our findings suggest that this is not the case, however. It is also unlikely that transposable, or known repetitive, elements mediate the link between recombination hotspots and microsatellites in yeast, since they are not over-represented in the yeast hotspots we studied [42].

The two most obvious factors that could contribute to the association are a mutation bias, relating to recombination, or some other property of hotspot regions, causing microsatellites to form and grow, and regulation of hotspot locations by simple sequences. We attempted to isolate evidence for a mutagenic effect of recombination on microsatellites by investigating short arrays, as these are not likely to be significantly effected by replication slippage, and there is no available evidence to suggest that they have the potential to stimulate recombination. We did not find strong associations with hotspots for low-copy repeats, however, and previous evidence suggests that long microsatellites have the potential to stimulate recombination, as well as to be mutated by it. Some previous findings have cast doubt on the possibility that these phenomena have a widespread influence, and this has limited the amount of attention they have so far been given, but other evidence, including our results, suggests that they should be tested further.

Evidence that microsatellites could play a role in regulating recombination has been found at a chromosomal level in S.cerevisiae for poly-A [11], poly-AC [12, 14] and pentanucleotide [13] arrays, and using extra-chromosomal DNA molecules for several repeat types [6266]. The existence of hotspots without local microsatellites does not rule out a functional role for the sequences in recombination, since it has been established that mechanisms of hotspot regulation are heterogeneous [46, 48, 67]. High frequencies of microsatellites in some regions outside hotspots are also not conclusive evidence against their functional involvement, since the control of hotspot location has been shown to be complex and multi-levelled, with local and distal sequences, transcription factor binding and chromatin structure alterations all implicated (reviewed in [46, 48, 67]). The ability of microsatellites to bind transcription factors [68], and to affect chromatin structure in vitro [69] and in vivo [70], therefore suggest two ways in which they could function to potentiate recombination at a subset of hotspots. This could happen without DSBs actually occurring in microsatellites; deletion of a 14 bp poly-A tract reduced activity of the yeast ARG4 hotspot by 75% despite the fact that DSBs avoid poly-A [71, 72].

It is also plausible that recombination is involved in some proportion of microsatellite mutations. The vast, presently unexplained, differences in mutation rates between loci (reviewed in [30, 73]) suggest the involvement of heterogeneous mutational mechanisms or regional mutation biases. In model organisms, evidence has been found both for [12, 33, 74] and against [34, 35] a role for recombination, in the mutation of different types of microsatellite. Studies have shown microsatellite mutation rates on the human Y Chromosome to be similar to autosomal levels [3638], but concluding from these that recombination does not play a role in microsatellite evolution is problematic, since the Y chromosome undergoes intramolecular recombination [75]. It is therefore possible that meiotic recombination, or other properties of its hotspots, could contribute to the variability in microsatellite mutation rates at different chromosomal locations. Although unequal crossing over, or meiotic gene conversion (recombination without exchange of flanking markers), are the most obvious mechanisms for this, other factors could be important, such as replication pausing, which has been linked to microsatellite mutations [76, 77], and may be causally involved in a subset of recombination hotspots [67].

Conclusion

We found that high-copy, short-motif microsatellites are strongly associated with S. cerevisiae meiotic recombination hotspots. The association is weak or absent for low-copy repeats. Our results add to the weight of evidence in favour of further studying the link between microsatellites and recombination hotspots. Large-scale experimental studies in yeast could be used to quantify the level of influence hotspots have on microsatellite evolution, and to explore the possible functional role of microsatellites in regulating recombination. This work could include tracking microsatellite mutations in mono-clonal yeast populations from recombining and non recombining strains. The effect on recombination frequency of deleting microsatellites from hotspots could also be tested.

Methods

Figures for transcriptional activity were from the study by Holstege and co-workers (1998) who mapped transcription frequency in vegetative cells for each yeast ORF [59]. For IGRs, we took the mean of the two adjacent ORFs.

Detection of microsatellites

We detected microsatellites in the yeast genome using an algorithm written in C [78]. The programme initially generated databases of all non-overlapping repeats of two copies or greater for repeated motif sizes two to six bp, and three copies or greater for mononucleotide arrays. Separate databases were created for perfect repeats, arrays with a maximum of one mismatch allowed per ten bp of repeat sequence, and arrays with a maximum of one mismatch per six bp. Microsatellites overlapping two regions were excluded from the analysis. This occurred for less than one percent of arrays overall.

Categorization of microsatellites

Copy number groups were two, three to five, six or more and ten or more. For mononucleotide repeats, we used 14 or more instead of ten or more, since a 14 bp poly-A tract has been shown to be a functional component of a yeast recombination hotspot [11]. We divided mononucleotide microsatellites into the equivalent motif groups A/T and G/C. Dinculeotide microsatellites were considered as a whole for statistical comparisons, but we divided them into the motif groups AT/TA, AC/CA/TG/GT, AG/GA/TC/CT and CG/GC in order to see the relative abundance of each motif type within the class. We examined sequence motifs of microsatellites with three to six bp motifs visually. We investigated compound and highly degenerate microsatellites by looking at numbers of arrays within five or ten bp of another microsatellite of the same or larger copy number group.

Statistical analysis

Statistical comparison of means (Student's T-test and Mann-Whitney U Test, 2-tailed tests in call cases) and correlation analyses (Spearman's Rho) were done using SPSS or SAS. We initially tested the distribution of each sample for normality (Kolmogorov-Smirnov Test) and subjected significantly non-normal samples only to non-parametric tests. Because repeats were divided into 19 physically independent categories for statistical testing, Bonferroni's correction was used to set the alpha level at 0.05/19 = 0.0026. For the purpose of this calculation, the number of categories did not include different mismatch types, because, within motif and size classes, these overlap substantially so are not independent from each other. For the same reason, the size class six copies and longer was not considered to be independent of the class 10 copies and longer for the purpose of calculating the number of independent categories. Bonferroni's correction is clearly very conservative for this study, because we lose statistical power with increasing numbers of categories due to the fact that there are proportionally fewer microsatellites in each category. We would therefore gain a large amount of power by limiting the categorization to a 4-way division of microsatellites into short and long mononucleotide repeats, and short and long 2–6 bp motif repeats. This would not change the main conclusions of the paper, because, for all motif lengths, long microsatellites are either more frequent in hotspots or are extremely rare (Additional file 1, Tables S1 and S2). Some interesting information would be lost with this scheme, since poly-A and dinucleotide repeats are highly predominant among long microsatellites, and two-copy repeats are vastly more frequent than 3–5 copy repeats (Additional file 1, Tables S1 and S2), so we favoured the 19-way division.