The planktonic microcrustacean Daphnia pulex is a key species in lake ecosystems and forms an important link between the primary producers and the carnivores. It is among the best-studied animals in ecological, toxicological, and evolutionary research [14]. With the availability of the v1.1 draft genome sequence assembly for D. pulex it is now possible to analyse the genome in a comparative context.

Tandem repeats (TRs) are characteristic features of eukaryotic and prokaryotic genomes [513]. Traditionally, they are categorized according to their unit size into microsatellites (short tandem repeats, STRs, 1-6 bp (1-10 in some publications) repeat unit size), minisatellites (10 to approximately 100 bp repeat unit size), and longer satellite DNA (repeat units of >100 bp). Typically, STRs contribute between 0.5 - 3% to the total genome size.

TR loci in general, and micro- and minisatellite loci in particular, are often highly dynamic genomic regions with a high rate of length-altering mutations [14, 15]. Therefore, they are frequently used as informative molecular markers in population genetic, forensic, and molecular ecological studies [6, 1622]. Due to their high abundance in genomes, microsatellites (STRs) are useful markers for genome mapping studies [2326].

In contrast to the early view that TRs are mostly non-functional "junk DNA", the picture has emerged in recent years that a high proportion of TRs could have either functional or evolutionary significance [2734]: TRs frequently occur within or in the proximity of genes, i.e., either in the untranslated regions (UTRs) up- and downstream of open reading frames, within introns, or in coding regions (CDS) [32]. Recent evidence supports that TRs in introns, UTRs, and CDS regions can play a significant role in regulating gene expression and modulating gene function [32, 35, 36]. Highly variable TR loci were shown to be important for rapid phenotypic differentiations [37, 38]. They can act as "evolutionary tuning knobs" which allow fast genetic adaptations on ecological timescales [[34] for review, see also [39]]. Furthermore, TRs can be of profound structural as well as evolutionary importance, since genomic regions with a high density of TRs, e.g., telomeric, centromeric, and heterochromatic regions, often have specific properties such as alternative DNA structure and packaging. The structure of DNA can, in turn, influence the level of gene expression in these genomic regions [28, 33, 34, 37, 40]. Altogether, the analysis of the TR content of genomes is important for an understanding of genome evolution and organisation as well as gene expression and function.

TR characteristics in different taxa and different genomic regions

With the rapid accumulation of whole genome sequence data in the last decade, several studies revealed that STR densities, usage of repeat types, length characteristics, and typical imperfection vary fundamentally between taxonomic groups [9, 11, 4144] and even among closely related species [4548]. In addition, strong differences of STR characteristics among different genomic regions have been described [9, 12, 43, 44, 49]. The often taxon-specific accumulated occurrence of certain repeat types in different genomic regions can hint at a functional importance of these elements. These characteristics are interesting from a comparative genomics as well as an evolutionary genomics point of view [9, 11, 12, 43, 44, 50, 51].

Related work

Several studies have been conducted in the past to compare the characteristics of microsatellites (1-6 bp or 1-10 bp) among different taxa and different genomic regions, e.g. [9, 44]. In these studies, however, the characteristics of TRs with a unit size >6 bp or >10 bp have been neglected. It has sometimes been argued that repeats with a unit size above 10 bp are generally rare in genomes, a presumption that has never been systematically tested. Furthermore, most studies are restricted to perfect TRs, with the main advantage that this significantly simplifies their identification. Disadvantages of this approach are that imperfections are a taxon-dependent natural feature of TRs and therefore should be included rather than neglected in an analysis. But even more important, TRs with long units tend to be more imperfect [10, 52] so that a meaningful survey, which includes repeats with a unit size above 10 bp, has to include imperfect repeats.

Studies on characteristics of microsatellites can also be categorized according to whether they use the TR coverage of a sequence (in this paper referred to as the density, see Methods), or a number count of TRs per sequence length as the main characteristics of TRs. We recommend the use of a TR density (as in [9]) instead of number counts, since the latter do not represent the true TR content of a sequence. For example, the number count of a single perfect, 10000 bp long repeat, which might cover 20% of a sequence, is the same as that of a 20 bp repeat that only covers 0.04% of the same sequence. Depending on the number of mismatches, indels or sequencing errors, as well as the allowed degree of imperfection, the same 10000 bp repeat can be counted as one or a variety of different numbers of satellites. Hence, TR densities have the clear advantage that they show a much smaller dependence on the allowed degree of imperfection of a satellite than number counts.


The aim of this comparative genomic study is to analyse the density and length characteristic of perfect and imperfect TRs in the 197.3 Mbp nuclear genome of the newly sequenced model crustacean D. pulex and compare these to the characteristics of TRs in eleven other eukaryotic genomes from very different taxonomic groups ranging in size from 12.1 Mbp to 3080 Mbp (Table 1). For the annotated genomes of Daphnia pulex, Drosophila melanogaster, and Apis mellifera we also compare the repeat characteristics among different genomic regions (5'UTR, 3'UTR, CDS, introns, intergenic regions). In regions with a defined strandedness we also investigate whether the densities of repeat types differ from the densities of their reverse complements.

Table 1 List of species genomes analysed in the present study together with basic information on the genome assembly.


Genome sequence data

The twelve sequenced genomes analysed in the present study are listed in Table 1. This list also contains the size, the CG-content, the assembly versions, and the download reference of the studied genomes. The size refers to the number of base pairs in the haploid genome. It reflects the current state of the genome build and includes known nucleotides as well as unknown nucleotides (Ns). CG-content, and genome size were determined with a self-written program. For D. melanogaster, the analysis of TRs in the complete genome includes the Het (heterochromatic), U and Uextra sequence files. Similarly, for A. mellifera, we included scaffolds in the file GroupUn_20060310.fa.

Gene locations and features

For the D. pulex genome we obtained the most recent 'frozen gene catalogue' of the v1.1 draft genome sequence assembly from January 29th 2008 in the generic GFF (General Feature Format) from Andrea Aerts (DOE Joint Genome Institute), which in similar form is available from This catalogue contains the predicted and to some extent still putative gene locations. For each gene model, it provides the predicted locations of exons, and for most genes also the locations of coding regions, start and stop codons. Since the catalogue often contains multiple or alternative gene models at the same locus as well as duplicate or overlapping features of the same type within the same gene model, a C++ program was written by CM to remove multiple gene models in order to avoid an overrepresentation of these loci in the analysis. To be more precise, if two predicted gene models overlapped and if both genes were found in the same reading direction, the longer of the two gene models was removed. Similarly, if two exons or two coding (CDS) features of the same gene overlapped, the longer of the two features was removed. Introns and intergenic regions were identified by the locations of exons that are associated to the same gene model. If available, the start and stop codon positions within exons of a gene were used to infer the locations of 5' and 3'UTR. This information on the positions of different genomic regions was finally used to split the genome sequences into six sequence files, each containing the sequence fragments associated to exons, introns, 5'UTRs, 3'UTRs, CDS, or intergenic regions. Since the TR characteristics of exons are just a combination of the TR characteristics of CDS and UTR regions, they have not been included in the present analysis.

For A. mellifera we used the same procedure as for D. pulex. A GFF file with annotation information was obtained from Unfortunately, the annotated features have so far not been officially mapped on assembly version 4.0, so the TR analysis of genomic regions had to be performed with assembly version 2.0.

For the D. melanogaster genome, separate sequence files for the six different features of interest can readily be downloaded from Since also these files contain multiply or alternatively annotated features, again a C++ program written by CM was used to consistently remove the longer of two overlapping features if both were of the same feature type and annotated in the same reading direction. The separate sequence files for different genomic regions do not include the sequence fragments found in the Het (heterochromatic), U and Uext sequence files of the current assembly, since these regions have not yet been annotated [53].

For the 5'UTRs, 3'UTRs, introns, and CDS regions of the three genomes we extracted and analysed always the sense strand of the corresponding gene. This provides the opportunity to identify differences in the repeat characteristics of the sense and anti-sense strands, i.e. to search for a so-called strandedness.

Terms and Conventions

For a given TR unit, the associated repeat type is defined as follows: All TRs with units that differ from the given repeat unit only by circular permutations and/or the reverse complement are associated to the same repeat type. Clearly, there are always several repeat units, which belong to the same repeat type. We follow the convention to represent a repeat type by that unit which comes first in an alphabetical ordering of all units that are associated to it [54]. This convention allows us to count and identify repeat units without reference to the repeat unit phase or strand. To give an example, the repeat type represented by the unit AAG incorporates all TRs with units AAG, AGA, GAA, TTC, TCT, and CTT. Furthermore, the term repeat motif is used instead of the term repeat type when we aim at distinguishing between sense and anti-sense strand repeat characteristics, but not the repeat phase. Hence, on the level of repeat motifs, AAG, AGA, GAA are all represented by AAG, but are distinguished from the repeat motif CTT, which also represents TTC and TCT. Finally, the terms repeat type and repeat motif are distinguished from the term repeat class which we use to denote the collection of all repeats with the same repeat unit size (e.g. mono-, di-, trinucleotide repeats).

An important property of one or a set of TR types is their density within a nucleotide sequence. It is defined as the fraction of base pairs that are found within repeats of a given set of repeat types over the total number of base pairs in the sequence. Repeat type densities are measured in base pairs per megabase pairs (bp/Mbp). It can be envisaged as the coverage of the sequence with the specified repeat types. Since in several genomes, including D. pulex, the number of (Ns) contributes significantly to the total size, all TR densities computed in this work were corrected for the number of Ns. It is important to distinguish repeat densities from densities based on number counts of repeats (measured in counts/Mbp) that are sometimes used in publications, e.g. [44, 47, 51].

TR detection and analysis

The characteristics of perfect and imperfect TRs strongly depend on the properties individual satellites have to fulfil to be included in the analysis. For perfect TRs this is the minimum repeat length or its associated alignment score, which in TR search programs is often defined as a function of the unit size. Changing the minimum unit size has an effect not only on the total density of different TR types, but also on relative densities since the length distribution of different repeat types usually differ strongly. For imperfect TRs it is additionally necessary to restrict or penalize their imperfection, e.g. with a mismatch and gap penalty. Furthermore, an optimality criterion has to be specified that determines which of two alternative alignments of a putative TR locus with its perfect counterparts is to be preferred.

In the present work, TRs were detected using Phobos, version 3.2.6 and 3.3.0 [55]. Phobos is a highly accurate TR search tool that is able to identify perfect and imperfect TRs in a unit size range from 1 bp to >5000 bp without using a pre-specified motif library. The optimality criterion Phobos uses is the alignment score of the repeat region with a perfect repeat counterpart. This means that each putative TR is extended in both directions as far as possible, by including gaps and mismatches, if this leads to a higher alignment score (see the Phobos manual for details [55]). For the present analyses, the alignment scores for match, mismatch, gap and N positions were 1, -5, -5, 0 respectively. In every TR the first repeat unit was not scored. Only a maximum number of four successive Ns were allowed. For a TR to be considered in the analysis it was required to have a minimum repeat alignment score of 12 if its unit size was less or equal to 12 bp or a score of at least the unit size for unit sizes above 12 bp. As a consequence, mono-, di-, and trinucleotide repeats were required to have a minimum length of at least 13, 14, and 15 bp to achieve the minimum score. For repeat units above 12 bp a perfect repeat had to be at least two units long, an imperfect repeat even longer, to achieve the minimum score. For this study, imperfect TRs were analysed in two size ranges: 1-50 bp and 1-4000 bp. For both size ranges a recursion depth of five was used. For the size range 1-50 bp the maximum score reduction was unlimited, for the size range 1-4000 bp the maximum score reduction was set to 30 to accelerate the computation while preserving a good accuracy. For details concerning the search strategy of Phobos and its scoring scheme the reader is referred to the Phobos manual [55].

Phobos has been used for this analysis since it is more accurate in the unit size range 1-50 bp than other TR search tools. Besides searching for imperfect repeats, Phobos is also able to identify whether alternative alignments exist for a TR. For example, the (ACACAT)N repeat can be viewed as an imperfect dinucleotide or a perfect hexanucleotide repeat. In this discipline, the Tandem Repeats Finder (TRF) [52] is the only alternative. While it is the state of the art in the detection of imperfect repeats with long unit sizes, it is based on a probabilistic search algorithm. In particular, it is less accurate when detecting TRs with a short unit size and a small number of copies. In contrast, Phobos uses an exact (non-probabilistic) search algorithm necessary for a meaningful statistical analysis of TR characteristics. The search parameters used in this analysis are being compared to the default search parameters used in the TRF program in the Additional file 1. TR characteristics such as the density and mean length of repeat types were computed using the program Sat-Stat, version 1.3.1 developed by CM.

In principle, results can be compared to TR databases available [5660]. However, due to the differences in search parameters and problems related to probabilistic searches such a comparison makes sense in few cases only and has therefore not been performed in this study.


Characteristics of STRs in all 12 genomes

Genomic density

For a first comparison the genomic density of imperfect STRs has been plotted against the genome size of the twelve species analysed in this study (Figure 1a). The genome sizes as well as the genomic densities of STRs vary considerably among the 12 taxa. The three arthropods in this analysis, D. pulex, D. melanogaster, and A. mellifera, show only slight differences in genome size, but large differences in the density of STRs (Figure 1a, Table 2). Among the three arthropods, D. pulex has by far the lowest STR density with a value of almost one third of A. mellifera. Compared to all other 11 genomes the STR density in D. pulex is about average. No significant correlation was found between the genome size and the density of STRs (Pearson correlation coefficient: R = 0.483, P = 0.111). See also Additional file 2, where the data of Figure 1 are presented for perfect and for truly imperfect TRs in two separate graphs. Most notable, D. pulex, but also A. mellifera have much higher densities of perfect than imperfect STRs.

Table 2 Main characteristics of STRs in the genome of Daphnia pulex and 11 other taxa.
Figure 1
figure 1

a) Genome size (on logarithmic scale) versus genomic TR density and b) mean repeat lengths of perfect and imperfect short tandem repeats (1-6 bp) in Daphnia pulex and 11 other eukaryotic genomes. In the Additional file 2 we provide four related Figures where the information found in Figure 1 is shown separately for perfect and purely imperfect tandem repeats.

Mean length

A comparison of genome sizes and mean lengths of imperfect STRs of all 12 genomes is shown in Figure 1b. Even though the mean repeat length depends crucially on the search parameters for TRs, general trends can be seen in this comparison: STRs are shortest in D. pulex (average length 19.48 bp) and longest in M. musculus (average length 38.3 bp), see Figure 1b and Table 2. No significant correlation between genome size and mean length of STRs was found (Pearson correlation coefficient: R = 0.489, P = 0.107).

Whereas for the three vertebrate species a high TR density is correlated with a high value of the mean repeat length, no similar correlation can be observed for the three arthropods. While A. mellifera has a STR density of almost twice the value of D. melanogaster, the STRs are on average 20% longer in D. melanogaster than in A. mellifera. In the Additional file 2, we present separate analyses of perfect and truly imperfect TRs. Most notable is that C. elegans, despite of its low density of truly imperfect repeats has on average very long imperfect TRs.

Genomic densities of mono- to hexanucleotide repeat classes

A more detailed comparison of the genomic densities of mono- to hexanucleotide repeat classes of all 12 taxa is presented in Figure 2. Whereas the upper panel shows the absolute repeat class densities, the lower panel shows their relative contribution to the STR density. Even better than from Figure 1a it becomes obvious that the absolute STR densities are highly variable even among taxonomically more closely related taxa such as the three arthropod species, the vertebrates, or the fungi species. Comparing the relative densities of STR classes, some taxon-specific trends are detectable (Figure 2, lower panel): C. elegans has a high relative density of hexanucleotide repeats, whereas pentanucleotide repeats are rare. All vertebrate species exhibit a particularly high proportion of tetranucleotide repeats while trinucleotide repeats are relatively rare. The two phytoplankton species have almost no mononucleotide repeats longer than 12 bp (minimum score 12, see Methods), whereas trinucleotide repeats are highly overrepresented. A high proportion of trinucleotide repeats is also found in the two fungi.

Figure 2
figure 2

Absolute genomic densities ( upper panel ) and relative genomic densities ( lower panel ) of short tandem repeats (mono- to hexanucleotide repeats) in Daphnia pulex and 11 other genomes.

Comparing the relative densities of STR classes among the three arthropod species, we find that trinucleotide repeats are strongly overrepresented in D. pulex, contributing 30% to all STRs (Figure 2). The proportions of mono-, tetra-, penta-, and hexanucleotide repeats are almost identical in D. pulex and A. mellifera. With the exception of similar tetranucleotide densities there are no common features among D. pulex and the other two arthropod species.

Genomic densities of mono- to trinucleotide repeat types

Repeat type usage of mono-, di-, and trinucleotide repeats in the 12 genomes is very different (Table 3). Only the density of ACT repeats is consistently low in all species. Even among more closely related species, only few common features can be observed. Poly-A repeat densities are generally high except for T. pseudonana and O. lucimarinus, where they are even lower than poly-C repeats. In D. pulex, poly-C repeats have the highest genomic density compared to the other genomes. In vertebrates, AAT repeat densities are similarly high, CCG repeat densities are low, and ACG repeats are virtually absent. Among the three arthropods, only the relatively low densities of the ATC repeats are of similar magnitude. The repeat types AC, ACG, and CCG with low densities for most taxa have particularly high densities in O. lucimarinus. The AGG repeat type has high densities only in A. mellifera and M. musculus.

Table 3 Tandem repeat types of mono- to trinucleotide repeats for the genome of D. pulex and eleven other taxa.

Characteristics of TRs with unit sizes 1-50 bp in all 12 genomes

In contrast to most studies that only analysed STRs with a unit size of 1-6 bp, we compared the TR content of the 12 genomes in three unit size ranges: 1-6 bp, 1-10 bp, and 1-50 bp (Figure 3). The results show that in all 12 genomes the density of TRs with a unit size in the range 7-50 bp contributes significantly to the density of TRs in the unit size range 1-50 bp. The contribution ranges between 26.1% in M. musculus and 83.5% in C. elegans with a mean value of 42.8%. The contribution of 40.9% in D. pulex is slightly below average. In three genomes, i.e., D. melanogaster, C. elegans, and O. lucimarinus, the density of TRs with a unit size above 6 bp exceeds the density of STRs (Figure 3).

Figure 3
figure 3

Genomic density of tandem repeats in the three different unit size ranges 1-6 bp, 7-10 bp and 11-50 bp for Daphnia pulex and 11 other genomes.

Among the 12 genomes, strong differences are found for the density of TRs in the three unit size ranges and in individual repeat classes (Additional file 3). No systematic pattern can be observed for the arthropod, vertebrate, or fungi genomes. Compared to the other 11 genomes, the TR density in D. pulex is slightly below average in all three unit size ranges. Among the three arthropods, D. pulex has not only the lowest density of STRs as mentioned before, but also a density of TRs in the unit size range 1-50 bp which is about half the value found for D. melanogaster and A. mellifera (Figure 3, Table 4). For the three arthropod species in this study a more detailed analysis of the genomic density and length characteristics of TR classes in the range 1-50 bp is given in the following two sections.

Table 4 Repeat characteristics of TR classes with a unit size of 1 to 50 bp for Daphnia pulex, Drosophila melanogaster, and Apis mellifera.

Densities of the 1-50 bp repeat classes in the three arthropod species

Densities of the TR classes in the range 1-50 bp show strong differences among the three arthropod species (Figure 4, Table 4). In D. pulex, trinucleotide repeats represent the dominant repeat class followed by di- and mononucleotide repeats. Together, these three repeat classes contribute 47.16% to the total density of all repeat classes from 1-50 bp. Other repeat classes with a local maximum in the repeat class density are the 10, 12, 17, and 24 bp repeats (Table 4, Additional file 4). D. melanogaster, in contrast to the other two arthropods, shows a strong heterogeneity in repeat class densities. Genomic density is highest for TRs with a unit size of 11 bp followed by peaks at 5 and 12 bp (Table 4, Figure 4). Relatively high density peaks are also found for the repeat classes 21-24 bp, 30-36 bp, 39, 43, 45, and 46 bp. Especially for the longer repeat classes, there are usually only very few repeat types which contribute to the density of their repeat classes. For instance, the individual repeat types ACCAGTACGGG, ACCGAGTACGGG, and ACCAGTACGGGACCGAGTACGGG contribute 95.2% (5967.1 bp/Mbp), 76.4% (1736.4 bp/Mbp), and 71.0% (393.3 bp/Mbp) to the density of the (dominating) repeat classes 11 bp, 12 bp, and 23 bp, respectively. All three repeat types are highly similar, which shows that ACCAGTACGGG is the dominating repeat type in this genome. In A. mellifera, as in D. pulex, STR classes contribute most to the overall TR density. Mono- to tetranucleotide repeat densities are higher than in the two other arthropods. The highest density is contributed by the dinucleotide repeats, which have a genomic density more than three times as high as in the other two arthropod species. The small local density maxima at 10 and 12 bp are similar to D. pulex. TRs with longer repeat units have very low densities with a small local maximum only for 26 bp and 36 bp repeats.

Figure 4
figure 4

Genomic density of tandem repeats with a unit size of 1-50 bp (dark columns) and their respective length characteristics (grey lines with boxes) for the three arthropod species investigated in this study.

Mean lengths of the 1-50 bp repeat classes in the three arthropod species

Similar to the repeat densities, strong differences between the mean lengths of TRs with respect to the unit size are observed for the three arthropod species (Figure 4, Table 4). Since the minimum length of TRs is twice the unit size, it is expected to see a trend toward longer repeats for an increasing unit size. Roughly, this trend can be confirmed for D. pulex and A. mellifera, whereas for D. melanogaster a trend can only be seen when not taking into account some of the repeat classes with extraordinarily long repeats. In D. pulex and A. mellifera, all mean repeat lengths are shorter than 254 bp in the unit size range 1-50 bp. D. pulex shows a notable peak for the mean repeat lengths of 17 bp repeats, a repeat class that is discussed in detail below. Among the smaller peaks in the mean repeat length spectrum of D. pulex there is a trend towards peaks that correspond to repeat classes that are multiples of three base pairs (Figure 4, Additional file 4).

In contrast, D. melanogaster has mean repeat length peaks above 500 bp for several repeat classes. This explains why the genomic density of TRs found in D. melanogaster is twice as high as in D. pulex even though the total number of TRs is lower (Table 4). A maximum mean repeat length of 2057 bp is found for the 46 bp repeat class which consists of 12 repeats ranging in length from 355 bp to 11248. It should be mentioned at this point that the high densities of longer repeat classes in D. melanogaster are concentrated in the heterochromatic regions of this genome. The sequencing and assembly of these regions was so difficult that this was done in a separate Heterochromatin Genome Project [61, 62]. See also the discussion below.

Characteristics of TRs with unit sizes 1-50 bp in different genomic regions

Patterns of TR densities and length characteristics were analysed in detail for the different genomic regions of D. pulex, its reference genome D. melanogaster, and A. mellifera (Figures 5, 6, 7, Additional file 5). The number of sequences in the genomic regions, their base content and length characteristics are given in Table 5. Both median and mean sizes of the different genomic regions are listed for a more comprehensive picture. The same information, but for the repeat sequences is given in Table 6. Comparing the TR densities among corresponding genomic regions in the unit size ranges 1-6 bp, 1-10 bp and 1-50 bp (Figure 5), the TR densities were generally highest in A. mellifera, lower in D. melanogaster and lowest D. pulex, with the only exception of a higher TR density in introns of D. pulex than in D. melanogaster. In all three genomes, the density contribution of the 7-50 bp repeat classes to all repeats in the size range 1-50 bp is much higher in CDS and intergenic regions than in introns and UTRs (see also Additional file 5). In CDS regions the contribution of 7-50 bp repeats is highest, with 72.8% in D. pulex, followed by 52.1% and 44.0% in D. melanogaster and A. mellifera, respectively. For all three species and in all size ranges, the densities are lowest in CDS regions. TR densities in D. pulex and A. mellifera are highest in introns in all unit size ranges, followed by intergenic regions, with a much higher difference in D. pulex. In D. melanogaster, STRs are most abundant in 3'UTRs closely followed by introns, 5'UTRs, and intergenic regions (Additional file 5). In the unit size range 1-50 bp, repeats are more dense in intergenic regions due to the high density of TRs with longer units in the vicinity of heterochromatic regions. It should be noted that a major proportion of heterochromatic regions is not included in the intergenic regions data set (see Methods for the origin of these files), since in these regions genes are not reliably annotated. However, since there are no clear boundaries between heterochromatic and euchromatic regions, some of the typical repeats found in heterochromatic regions are also found in the intergenic regions.

Table 5 Characteristics of the CDS, introns, and intergenic regions of D. pulex, D. melanogaster, and A. mellifera.
Table 6 Characteristics of the TRs found in the CDS regions, introns, and intergenic regions of D. pulex, D. melanogaster, and A. mellifera.
Figure 5
figure 5

Tandem repeat densities in different genomic regions of Daphnia pulex, Apis mellifera, and the euchromatic genome of Drosophila melanogaster in the unit size ranges 1-6 bp, 7-10, and 11-50 bp.

TR classes

Genomic densities of TR classes show high dissimilarities among the different genomic regions of D. pulex, D. melanogaster, and A. mellifera. In CDS regions of all three genomes, repeat densities are dominated by repeat classes with unit sizes that are multiples of 3 bp, consistent with the reading frame (Additional file 5, Figure 6), see also [63]. Notable exceptions are 10 and 20 bp repeat classes in D. pulex and 10 bp, 11 bp, and 16 bp repeat classes in A. mellifera, which have not only relatively high densities in CDS regions, but also relatively long repeat regions. The proportion of repeats (based on number counts) in the unit size range 1-50 bp not consistent with the reading frame is 11.4% in D. pulex, 3.1% in D. melanogaster, and 22.7% in A. mellifera.

Figure 6
figure 6

Genomic density of tandem repeats with a unit size of 1-50 bp in different genomic regions in Daphnia pulex , the euchromatic genome of Drosophila melanogaster, and Apis mellifera (columns) and their respective average lengths (grey lines, secondary y-axis).

Several repeat classes are more dense in CDS regions than in other regions, e.g. the densities of the 24 bp repeat class in D. pulex, the 39 bp repeat class of D. melanogaster, and the 6, 10, 15, 16, 18, 21, 30, 36 bp repeat classes of A. mellifera are significantly higher in CDS regions than in all other regions. In a separate analysis conducted only for D. pulex, we searched for TRs in the size range 1-4000 bp in CDS regions. The results show repeat densities above 100 bp/Mbp also for the 51, 52, 60, 75, 108, and the 276 bp repeat classes. A list of all TRs found in CDS regions of D. pulex is given in Additional file 6.

In introns of D. pulex and D. melanogaster the proportion of STRs is higher than in the other genomic regions, whereas in A. mellifera, with a general trend to shorter repeat units, this cannot be observed. In D. pulex, the repeat classes with a unit size of 1-5 bp and 7-8 bp show by far the highest densities in introns as compared to other genomic regions (Additional file 5). Most dominant are trinucleotide repeats, which are more dense in introns of D. pulex than in introns of D. melanogaster and A. mellifera. A notable feature in introns of D. melanogaster is the relatively high density of the 31 bp repeat class. The intergenic regions of D. pulex and D. melanogaster show high densities for several longer repeat classes which are rare or absent in other regions (Figure 6, Additional file 5). In D. pulex, e.g., the 17 bp repeat class shows a high repeat density only in intergenic regions, whereas in the other two arthropods it is relatively rare in all genomic regions. Repeat classes with a particularly high density in intergenic regions can be found in Additional file 5. Concerning the UTRs in D. pulex, the TR statistics has to be treated with caution for repeat classes longer than 3 bp, since only a small proportion of genes has well annotated UTRs so that the total number of TRs found in 5' and 3'UTRs (135 and 653) is low. For example, the inflated density of the 24 bp repeat class in 5'UTRs of D. pulex is based on just a single 272 bp long repeat. As a general result, TRs with short units dominate in UTRs.

Mean lengths of the TR classes in the different genomic regions are more heterogeneous in D. melanogaster than in D. pulex and A. mellifera. This is not just the case for intergenic regions including the heterochromatin, but also in introns (e.g. the 31 bp repeat class) and CDS regions (e.g. 39 bp and 48 bp repeat classes), see Figure 6.

TR motifs and strandedness

For genomic regions with annotated sense and anti-sense strands, we analysed whether the characteristics of TRs with certain repeat units differ on the two strands. In order to investigate this question we (i) always analysed the sense strand of annotated gene features and (ii) reported the repeat unit in a form normalized only with respect to the repeat phase (cyclic permutations), here called the repeat motif, instead of the repeat type, normalized with respect to phase and strand (cyclic permutations and the reverse complement, see Methods for details). Results, which include the information on the repeat motif strandedness are presented in Figure 7 and in the Additional file 7.

Figure 7
figure 7

Genomic density of trinucleotide repeat motif pairs (normal and reverse complement) in different genomic regions of Daphnia pulex, Drosophila melanogaster, and Apis mellifera. Whereas in intergenic regions both types are always of similar density, in introns and CDS regions there are often strong differences in densities supporting a strand-specific repeat motif usage (strandedness). Lines with boxes show the respective mean repeat length (secondary y-axis).

For D. pulex, D. melanogaster, and A. mellifera repeat motif usage shows only few common features among the genomes and different genomic regions. Common features of all three genomes are a relatively high density of poly-A/T repeats in introns and intergenic regions, low densities of CG repeats in all regions, and higher densities of AAC and AGC repeats in CDS regions than in introns and intergenic regions. Repeat motifs that are more dense in introns than in CDS and intergenic repeats of all three genomes are poly-T, AT and GT (Additional file 7). Several repeat motifs show a strong strandedness in the CDS regions of all three genomes. Most notable are the repeat motifs AAC and AAG, which have much higher densities than their reverse complements GTT and CTT. A smaller but still existing trend is observed for AAT versus ATT repeats. Strandedness also occurs in introns of D. pulex, where poly-T repeats have much have higher densities than poly-A repeats. Other motif pairs with considerably different densities on the sense strand in introns are ATT versus AAT, CT versus AG, GT versus AC, and ATTT versus AAAT. In all these examples T-rich motifs are preferred on the sense strand.

Restricting the search for common features to D. pulex and D. melanogaster one finds that CCG/CGG repeats are predominantly found in CDS regions, whereas AT repeats show their highest densities in 3'UTRs (data not available for A. mellifera), see Additional file 7. The absolute densities of the AT repeat type in 3'UTRs, however, differ significantly with values of 220.5 and 2663.6 bp/Mbp in D. pulex and D. melanogaster, respectively. In both genomes, the dominant repeat motif in CDS regions is AGC, with a particularly high density of 1658.9 bp/Mbp in CDS regions of D. melanogaster.

Curiously, for both genomes (D. pulex and D. melanogaster), the repeat motif AGC shows much higher densities on the sense strand of CDS regions than its reverse complement, the repeat motif CTG (340.7 bp/Mbp versus 74.7 bp/Mbp and 1658.9 bp/Mbp versus 26.9 bp/Mbp, see Additional file 7). In introns of D. pulex, a strandedness for this motif is not present, whereas in introns of D. melanogaster it is much less pronounced. In contrast to D. pulex and D. melanogaster, the repeat motif AGC has only a moderate density in all regions of A. mellifera. Conversely, the dominant repeat motif in CDS regions of A. mellifera, ATG, is very rare in the other two genomes. Also this repeat motif shows a considerable strandedness in CDS regions of A. mellifera. Other repeat motifs with a high density in CDS regions of A. mellifera, but with low densities in the other genomes are ACT and AGT. Also notable is the high density of the dinucleotide (and thus reading frame incompatible) repeat motif CT (435.8 bp/Mbp) in CDS regions of A. mellifera and the strong discrepancy to the low density of its reverse complement AG (20.3 bp/Mbp). As mentioned before, short units are dominant in introns of all three genomes. Dominant repeat motifs in introns of D. pulex are poly-T followed by CT and CTT. Among tetranucleotide repeats, the motifs CTTT and ATTT show the highest densities. All these motifs have higher densities than their reverse complements. In introns of D. melanogaster, dominant repeat motifs are poly-A followed by poly-T and AT, with only a small strandedness of poly-A versus poly-T repeats. Densities in introns of A. mellifera are high for several repeat motifs. Most notable are the motifs AT followed by poly-A, poly-T, CT, AG, and AAT. The density of AT repeats in introns of A. mellifera (4069.0 bp/Mbp) constitutes the highest repeat motif density among the three genomes and their genomic regions. A notable strandedness is observed for the poly-A versus poly-T and for AAT versus ATT repeat motifs. In CDS regions of A. mellifera a high strandedness is also found for the AAGCAG motif (1480 bp/Mbp) versus CTGCTT (0.00 bp/Mbp). In introns, the two motifs still have the respective densities of 46.3 bp/Mbp versus 0.00 bp/Mbp.

Concerning the mean perfection of TR motifs in different genomic regions (see table in Additional file 7, page 10 for details) we could not find many general trends. In different genomic regions of D. pulex, the mean imperfection in the size range 1-50 bp was 98.36% in CDS regions, 99.09% in intergenic regions, and 99.31% in introns (the mean values are not shown in above mentioned table). For A. mellifera we found on average lower repeat perfections of 97.35% in CDS regions, 98.57% in intergenic regions, and 98.52% in introns. For D. melanogaster, mean repeat perfections are 97.35% in CDS regions, 98.55% in intergenic regions and 98.68% in introns. So in all three genomes, the mean repeat perfection is lowest in CDS regions. Differences in repeat perfection among introns and intergenic regions are small.

Strong differences among the three genomes are found for several repeat motifs: poly-C and poly-G densities are particularly low in A. mellifera, AT repeat densities are 20 and 30 times higher in intergenic regions and introns of A. mellifera as compared to D. pulex and AnG (n = 1 to 5) and ACG densities are much higher in D. pulex and A. mellifera than in D. melanogaster. For instance AAG repeat densities are about 40 times higher in introns and intergenic regions of D. pulex than in the same regions of D. melanogaster. Potentially interesting are TRs in CDS regions where the unit size is not directly compatible with the reading frame. As mentioned above, 10-mer repeats (and multiples of 10) have significant densities in CDS regions of D. pulex. Most notable are the repeat types AACCTTGGCG (Dappu-343799, Dappu-344050, Dappu-343482, Dappu-279322, Dappu-280555), ACGCCAGAGC (Dappu-264024, Dappu-264706, Dappu-275708), and ACGCCAGTGC (Dappu-267284, Dappu-267285, Dappu-275706, Dappu-275708, Dappu-277192). These three repeat types are completely absent in D. melanogaster and A. mellifera. Repeat motif usage in UTRs was only compared if the number of satellites in these regions was sufficiently high. All TR characteristics including the number counts are listed in Additional file 7. As a general result, repeat type usage is very heterogeneous on a genomic level as well as among different genomic regions. Within a given TR class there are usually only a few TR motifs which contribute to the density of the repeat class (Figure 7, Additional file 7).

Mean lengths of mono- to trinucleotide repeat types in different genomic regions of D. pulex show a relatively homogeneous length distribution, in contrast to the heterogeneous densities (Figure 7, Additional file 5). Peaks in average repeat length in the UTRs (see Additional file 5 and 7) must be regarded with caution due to small samples sizes (see above). In D. melanogaster and A. mellifera, TRs are generally longer than in D. pulex.

TRs with a unit size of 17 bp in D. pulex

The repeat class in D. pulex with the highest repeat density and a unit size longer than three base pairs is the 17 bp repeat class (Table 4). There are several notable aspects of these repeats: first of all, the true genomic density of 17 nucleotide repeats is likely to be underestimated in the current assembly since several scaffolds start or end with a 17-nucleotide repeat. For instance, the longest imperfect repeat found in D. pulex with a total length of 3259 bp is a 17 nucleotide repeat located at the end of scaffold 66. Three very similar repeat types, (AAAAGTTCAACTTTATG with 273.0 bp/Mbp, mean length 318.5 bp, AAAAGTAGAACTTTTCT with 209.8 bp/Mbp, mean length 739.62 bp, AAAAGTTCTACTTTGAC with 88.9 bp/Mbp, mean length 705.3 bp) contribute 88% to the total repeat density of 17 bp repeats. (Further repeat types were found that are similar to these three.) A striking characteristic of these repeat types is their high similarity to their reverse complement. The two repeat types with the highest density have only 5 non-matching positions when aligned to their reverse complement. This might hint at a functional role or structural importance of these repeats - see discussion. The mean length of all imperfect 17-mer nucleotide repeats is 270 bp, which is the highest value for repeats with a unit shorter than 46 bp in D. pulex. Repeats of the 17 bp repeat class are mostly found in intergenic regions with a density of 1039.4 bp/Mbp and mean length of 295.0 bp.

TRs with unit sizes above 50 bp in D. pulex

The results of the search for imperfect TRs in D. pulex with a motif size of 1-4000 bp are shown in Figure 8, in which the size range 1-50 bp has been removed since they are shown in Figure 4 and Additional file 4. The density spectrum shows an irregular pattern of density hotspots in certain size ranges. The TR with the longest unit size (1121 bp) has a total length of 2589 bp, which corresponds to 2.31 repeat units. TRs with a unit size of 171 bp are very abundant. They have the same size as the well-known alpha-satellites. Alpha-satellites are a family of long TRs near the centromers in vertebrate chromosomes and have frequently been reported [64]. Homology searches (Dotplots, BLAST) could not identify any similarity between the D. pulex satellites and the known alpha satellites of M. musculus and H. sapiens. Among the 10 non-mammalian genomes only D. pulex has a particularly high density of satellites in the unit size range 165-175 bp.

Figure 8
figure 8

Genomic densities of tandem repeat classes in the unit size range 50 - 4000 bp in the genome of D. pulex. The TR with the longest unit found in this genome has a unit size of 1121 bp. An accumulation of repeat densities is observed for specific repeat unit sizes, e.g. around 160 bp and 190 bp.


Tandem repeats, together with interspersed repeats, are key features of eukaryotic genomes and important for the understanding of genome evolution. For the newly sequenced crustacean D. pulex we have analysed the characteristics of TRs and compared them to the TR characteristics of 11 other genomes from very different evolutionary lineages. A particular focus was on comparing the genomes of A. mellifera and the model insect D. melanogaster because of their shared ancestry with Daphnia within the Pancrustacea, and despite their large evolutionary divergence, they best served to help annotate the D. pulex genome.

A general problem of TR analyses is that the detection criteria, the allowed degree of imperfection, the optimality criterion as well as the accuracy of the search algorithm can significantly influence the characteristics of TRs found in a search [65, 66]. Therefore, a direct comparison of TR characteristics of different genomes is only possible if analyses were carried out by the same search tool using the same search parameters. Despite differences in the detection criteria, a comparison of TR type densities for Homo sapiens analysed in this study and by Subramanian et al. [12] agree well in terms of absolute and relative densities (see Table 3 in this paper and Figures 3, 4 and 5 in [12]) supporting that general trends can well be independent of the search criteria. While Subramanian et al. [12] also used TR densities as the main characteristics, many studies rely on number counts. This type of data is difficult to compare to analyses using TR densities. Hence, in this paper we have compared our results mainly with those in Tóth et al. [9], since their detection criteria (perfect STRs, minimum length 13 bp), main characteristics (TR densities) and the compared taxa still come closest to those used in the present analysis. All comparisons drawn here have been confirmed (in a separate analysis) to hold true also when using the same search parameters as in [9].

Comparisons of TRs in the 12 genomes

Our analyses show that TRs contribute considerably to all genomes analysed in this study, which is consistent with earlier results ([5, 9, 11, 12, 51, 67] and many others). No TR characteristics were found that are common to all of the 12 genomes, except for a relatively low density of ACT repeats, which has already been reported in Tóth et al. [9]. The dominance of taxon rather than group specific characteristics has also been reported in [44, 51] when comparing number counts of satellites. As a general trend, Tóth and collaborators [9] also observed an underrepresentation of ACG repeats in most taxa. Our data support this trend with the striking exception of O. lucimarinus, where ACG repeats constitute the highest individual trinucleotide repeat type density in this study (Table 3). Curiously, the high absolute and relative di- and trinucleotide repeat densities found in O. lucimarinus are exclusively based on the high densities of the CG, ACG, and CCG repeat types that are uncommon in all other taxa in this study (see discussion below). The high CG-content of these three dominant repeat types is consistent with the high CG-content (60%) of the genome of O. lucimarinus.

Even within evolutionary lineages, common features of TR characteristics are rare. Notable are the clear dominance of poly-A over poly-C repeat types in all genomes except for the diatom and the green algae, the almost complete absence of mononucleotide repeats in the diatom and the green algae, and the almost complete absence of ACG repeats in vertebrates (Figure 2 and Table 3). Our data also supports the result of Tóth et al. [9] that the relative high proportion of tetranucleotide over trinucleotide repeat densities in vertebrates could not be found in any other taxonomic group. To establish these features as lineage specific, still more taxa need to be analysed. Besides these few cases of group specific similarities, this study reveals a high level of dissimilarity in genomic repeat class and repeat type densities among all taxonomic groups. Among the fungi, for example, the genomes of N. crassa and S. cerevisiae show no lineage specific similarities. In contrast to Tóth et al. [9], where AT and AAT repeats were the dominant di- and trinucleotide repeat types in genomes of fungi, N. crassa has a more than 2.6 times higher density of AC than AT repeats and a more than 3 times higher density of AAC than AAT repeats in this study. Also the three arthropod species, D. pulex, D. melanogaster, and A. mellifera show no remarkable similarities among mono- to hexanucleotide repeat class (Figure 2) or mono- to trinucleotide repeat type densities (Additional file 7). Several common features of arthropods that have been found in [9] cannot be confirmed in the present analysis: whereas these authors found dinucleotide TRs to constitute the dominant repeat class in arthropods, this cannot be confirmed in the present study for D. pulex where the density of trinucleotide repeats exceeds the density of dinucleotide repeats by 40%. Furthermore, in [9] AC was the dominant dinucleotide and AAC and AGC the dominant trinucleotide repeat types in arthropods, which is not the case for the genomes of A. mellifera and D. pulex. Most striking, the AC, AAC, and AGC repeat type densities are particularly low in A. mellifera, a genome for which an untypical repeat type usage, as compared to other arthropods, has already been mentioned in [68]. A. mellifera also stands out as the taxon with the highest density of mononucleotide repeats in this study, whereas in [9] this repeat class was found to be densest in primates. In contrast to [9], where penta- and hexanucleotide repeats were "invariably more frequent than tetranucleotide repeats in all non-vertebrate taxa", this cannot be confirmed in the present study.

Going beyond the scope of previous TR analyses ([9, 11, 43, 44] and others), we compared characteristics of TRs with unit sizes in the range 1-50 bp. Our results reveal that imperfect TRs with unit sizes larger than 6 bp contribute significantly to the TR content of all genomes analysed. The model nematode C. elegans, e.g., was commonly thought to have a very low density of genomic TRs [9], which is true for the unit size range 1-5 bp, but not for the size range 6-50 bp (Additional file 2, see also Figure 3). This finding leads to a completely new picture for the TR content of this organism.

Concerning the mean lengths of STR, this study showed that the genome of D. pulex is characterized by shorter STRs than the other genomes. Furthermore, among the STRs, perfect repeats have a higher density than imperfect repeats. Neglecting the still unknown contribution of unequal crossing-over to length altering mutations of STRs, their equilibrium lengths are the result of slippage events extending STRs and point mutations breaking perfect TRs into shorter repeats [41, 46, 69, 70]. The dominance of relatively short STRs in the genome of D. pulex indicates that the 'life cycle' of a typical TR is comparatively short, i.e. the frequency of interrupting point mutations is relatively high compared to extending slippage mutations. Furthermore, it has been discussed in the literature whether the typical length of TRs is inversely correlated to the effective population size (see e.g. [19]). Since large population sizes are a feature of D. pulex, our results are not in conflict to this conjecture.

Another interesting point is the typical perfection of TRs. Perfect TRs are believed to be subject to more length altering mutations than imperfect repeats, since a higher similarity of sequence segments increases the chance of slippage and homologous crossing-over events. Since the STRs found in D. pulex but also those in A. mellifera are predominantly perfect, we expect an increased number of length altering mutations in these two genomes. The mutability of STRs in D. pulex has been studied in detail by another group of the Daphnia Genomics Consortium, which compares the rate and spectrum of microsatellite mutations in D. pulex and C. elegans [71]. In view of this remark it is interesting that TRs in the size range 1-50 bp are on average more imperfect in CDS regions of all three arthropod genomes as compared to introns and intergenic regions.

A direct comparison of TRs with a unit size of 1-50 bp among the three arthropods shows remarkable differences. The dominant repeat classes (highest to lower densities) are the 2, 1, 3, 4, 5, and 10 bp repeat classes of A. mellifera, the 3, 2, 1, 17, 4, and 10 bp repeat classes in D. pulex and the 11, 5, 12, 2, 1, and 3 bp repeat classes in D. melanogaster. This highlights the trend towards shorter motifs in A. mellifera in contrast to the trend towards longer motifs in D. melanogaster. The relative dominance of 3 bp repeats in D. pulex likely reflects the great number of genes (>30000; Daphnia Genomics Consortium unpublished data) in this comparatively small genome. This same paper also states that D. pulex is one of the organisms most tightly packed with genes. Similar to the repeat densities, the mean lengths of TRs show remarkable differences among the three arthropods. An elevated mean length of TRs in a repeat class can hint at telomeric and centromeric repeats. In D. pulex, candidates for telomeric and centromeric repeats are found in the 17, 24, and 10 bp repeat classes. Since the long 17 bp repeats are usually located at the beginning or end of scaffolds, their true density is likely to be underestimated. Interestingly, just three very similar repeat types contribute 87% of the density to this repeat class. It is worth noting that the two repeat types with the highest density have only 5 non-matching positions when aligned to their reverse complement, which could lead to the formation of alternative secondary structures, see e.g. [33, 72].

As mentioned above, the CG, ACG and CCG repeat types are rare in all taxa except for O. lucimarinus, where the densities of these repeats are particularly high. Usually, the low densities of these motifs are explained by the high mutability of methylated CpG dinucleotides (as well as CpNpG trinucleotides in plants, where N can be any nucleotide), which efficiently disrupts CpG rich domains on short timescales. Since CCG repeat densities are also low in several organisms that do not methylate (C. elegans, Drosophila and yeast), Tóth et al. [9] argue in favour of other mechanisms, which lead to low CCG repeat densities, particularly in introns. According to our data, CpG and CpNpG mutations must certainly be suppressed in TR regions of O. lucimarinus. Furthermore, mechanisms which act against CpG-rich repeats in other species are not in effect in this genome. The particularly high densities of CG, ACG, and CCG as compared to all other mono- to trinucleotide repeat types in O. lucimarinus even raises the question whether CpG-rich repeats are simply favoured for unknown reasons, or whether they are prone to particularly high growth rates if their occurrence is not suppressed.

Interesting in this respect is a direct comparison of the densities of the ACG and AGC repeat types, which have identical nucleotide content on the same strand, but which differ in the occurrence of the CpG dinucleotide. The density ratio of AGC to ACG repeats ranges from high values in the vertebrates with a value of 63.4 in H. sapiens to 0.0040 in O. lucimarinus (Table 3). Even among the three arthropod species, this density ratio differs considerably: D. pulex (3.3), A. mellifera (0.28), and D. melanogaster (18.5). Interestingly, A. mellifera and O. lucimarinus are the only two species for which the density of ACG repeats is higher than the density of AGC repeats. Among the three arthropods, A. mellifera has the highest content of CpG containing TRs despite its lowest value for the genomic CG-content (34.9%) in this study. Consistent with this observation, a CpG content higher than in other arthropods and higher than expected from mononucleotide frequencies has been found previously, even though A. mellifera methylates CpG dinucleotides [73].

In D. pulex, the densities of An× (n = 1 to 10) repeat types are significantly overrepresented, a feature that has also been observed for other, distantly related species (H. sapiens [12], A. thaliana [44]). Lawson and Zhang [44] have argued that these repeats could have evolved from mutations in poly-A repeats.

TRs in genomic regions and their potential function

Several recent studies have shown that TRs are not just "junk DNA" but play an important role in genome organization, gene regulation and alternating gene function. They have gained particular interest due to their potential for rapid adaptations and several authors regard them as hotspots for evolutionary success of species [28, 34, 3639].

In D. pulex, STRs are predominantly found in introns with a clear preference for a small number of repeat types (AC, AG, AAG, AGC). Interestingly, all mono- to trinucleotide repeat types are densest in introns, with the exception of AT and CCG repeat types. A predominance of STRs in introns has not been reported for many genomes before, except e.g. for fungi in [9]. In D. melanogaster, STRs have highest densities in 3'UTR with a preference for AG, AT, AAC, and AGC repeats. Common to the D. pulex and D. melanogaster genome is the dominance of AC repeats in introns, AT repeats in 3'UTR, and CCG repeats in coding regions. Relatively high densities of CCG repeats in CDS regions and low densities in introns had also been reported for vertebrates and arthropods [9]. All these features are in contradiction to a model of neutral evolution of different TR types, see also [9, 34]. They suggest differential selection to prevail in different genomes and genomic regions, which in turn hints at an evolutionary or functional importance of TRs.

Concerning the density of different repeat classes in different genomic regions of D. pulex, the following observations are of particular interest: (i) The densities of the repeat classes 1-5, 7-8 bp are higher in introns than in CDS and intergenic regions. (ii) The densities of TRs with a unit size above 8 bp are much lower in introns than in the other regions. (iii) The densities of almost all repeat classes with a unit size longer than 10 bp that are a multiple of three are higher in CDS regions than in introns and even intergenic regions. (iv) The high density of trinucleotide repeats in introns raises the question how well introns have been annotated. Furthermore it would be interesting to determine DNA transfer rates between CDS regions and introns caused by mutations. This process could also be the reason for higher trinucleotide densities in introns. Observation (i) could be explained by a preference for TRs in introns that are more variable or that have higher repeat copy numbers, which both could be important for regulatory elements. Observation (ii) could indicate that TRs with longer motifs are not beneficial in introns. Alternatively, the restricted size of introns could be the limiting factor for TRs with longer motifs. Observation (iii), however, shows that the size of genomic features does not provide a good indication for the expected motif sizes of TRs. While introns and CDS regions have about the same size in D. pulex, (see Table 5) observations (i) to (iii) show opposite preferences for the motif size of TRs in these two regions. The tendency toward longer repeat motifs in coding regions is presumably caused by tandemly repeated amino acid sequences, in particular for the motif PPR (proline - proline - glycin) and suggests strong protein domain level selection. Most interestingly, the absolute density of TRs with a unit size of 7-50 bp in CDS regions of D. pulex is higher than in CDS regions of D. melanogaster, despite of the strong tendency towards longer repeat units in all other regions of D. melanogaster.

An interesting observation of our analysis is the strandedness found for some repeat motifs in CDS regions and introns. The fact that some motifs are favoured on a particular strand hints at a selective advantage that remains to be studied in more detail.

The overall strong differences in TR characteristics in genomes and genomic regions raises many questions. For the extreme outlier in respect to repeat type usage, O. lucimarinus, we found that the most dominant repeats have a high CG content, which correlates with the high CG content of the complete genome. It would certainly be interesting to study this putative correlation in a separate study. An observation of Riley et al. [33, 72] should be noted at this point. They have found that for repeats with putative regulatory function, the existence of the repeat and its overall structure is more important than the detailed base composition. This would allow organisms to have different repeat motifs with their preferred base composition at regulatory important segments of the genome.

Finding annotation problems with TRs

The question arises whether TRs can be used to detect problems or inconsistencies in the current annotation of genomes. For this reason we had a closer look at selected TRs occurring in coding regions of D. pulex (from Additional file 6). Only a small proportion of these annotated genes show a clearly low support, but the support deceased for annotated gene, which host multiple TRs, such as e.g. Dappu-243907 and Dappu-318831. Furthermore, we had a look at gene models that host TRs with a motif size that is not a multiple of three, e.g. the relatively dense 10 and 20 bp repeat classes. Among these gene models, several were found for which the TR has almost the same size as the CDS element. Interesting examples with almost identical repeat units are found in the following annotated genes (braces contain the length of the CDS element, the length of the TR as well as the repeat unit): Dappu-264024 (1075 bp, 1033, ACGCCAGAGC), Dappu-264706 (165 bp, 113 bp, ACGCCAGAGC), Dappu-267284 (414 bp, 395 bp, ACGCCAGTGC), Dappu-267285 (460, 459, ACGCCAGTGC), and Dappu-265168 (738 bp, 473 bp, AATGCACGCCAGTGC ACGCC). The numbers show that these CDS elements consist almost exclusively of the repeat pattern. The unit ACGCCA is indeed found in several other TRs in CDS regions of D. pulex. We found that the mean perfection of these 10-mer repeats (97.4%) is only marginally lower than that of 9-mer repeats (98.8%) or that of trinucleotide repeats (99.1%), indicating that their imperfection should not be an indication for a potential invariability of these 10-mer repeats in CDS regions. Another problematic finding is the high repeat content in exons of D. melanogaster of the two very similar repeat types with the unit AAACCAACTGAGGGAACGAGTGCCAAGCCTACAACTTTG (195.4 bp/Mbp) and AAACCAACTGAGGGAACTACGGCGAAGCCTACAACTTTG (109.1 bp/Mbp) with no contribution of these repeat types neither to CDS or UTRs, hinting at a problem in the annotation where these repeats occur.

Error margins

For the characteristics of TRs analysed in the present work we have not given any error margins, not because we do believe that our results are exact, but since an estimate of error margins is hardly feasible. While a minor source of uncertainty might be introduced by the TR search algorithm, the main source of error is the incomplete nature of most genome assemblies (see Table 1). The genomic sequences of the current assembly of D. pulex, A. mellifera, D. melanogaster, and H. sapiens for instance contain 19.6%, 15.6%, 3.8%, and 7.2% unknown nucleotides (Ns), respectively (Table 1). But even the apparently low number of Ns in the latter two organism might be too optimistic, which is phrased in [62] as follows: "... a telomere-to-telomere DNA sequence is not yet available for complex metazoans, including humans. The missing genomic "dark matter" is the heterochromatin, which is generally defined as repeat-rich regions concentrated in the centric and telomeric regions of chromosomes. Centric heterochromatin makes up at least 20% of human and 30% of fly genomes, respectively; thus, even for well-studied organisms such as D. melanogaster, fundamental questions about gene number and global genome structure remain unanswered."

For obvious reasons, most genome projects focus on sequencing easily accessible coding regions and leave aside highly repetitive regions which are difficult to sequence and assemble. As a consequence, TRs densities will be lower in sequenced than in unsequenced genomic regions, and error margins for TR densities cannot be assessed statistically, but depend on mostly unknown systematic errors of the current assembly. The implications for the present work are, that TR densities are likely to be underestimated for all genomes analysed. Among the three arthropods, D. melanogaster is the best-studied organism and the only one with an exclusive Heterochromatin Genome Project [61, 62]. For D. pulex and A. mellifera, heterochromatic regions have not yet been sequenced with the same effort. However, the contribution of heterochromatin in A. mellifera is estimated to be about 3% [73, 74], whereas in D. melanogaster the contribution is about 30%, without clear boundaries between euchromatin and heterochromatin [75]. These differences in sequencing status and different sizes of heterochromatic regions could lead to a bias of yet unknown direction.

Altogether, it is expected that this bias will not affect the generally robust trends we found in our analyses for the following reasons: in D. melanogaster, the trend towards longer repeats units appeared already in the first assemblies, while this has not been observed in A. mellifera. In this context it is interesting to note that the total density of STRs is still higher in A. mellifera than in D. melanogaster. In D. pulex, no reliable estimate of the contribution of heterochromatin is known. Our study indicates a trend to slightly higher contributions than in A. mellifera, but considerably lower contributions than in D. melanogaster.


The newly sequenced genome of Daphnia pulex shows several interesting characteristics of TRs which distinguish it from the other model arthropods D. melanogaster and A. mellifera. The density of TRs is much lower than in the two other arthropods. The mean length of STRs was shortest among all genomes in this study. From a functional perspective it is interesting that STRs are by far densest in introns and that the contribution of TRs with units longer than 6 bp in CDS regions of D. pulex is even higher than in D. melanogaster. The finding of a strong strand bias in repeat motif usage (strandedness) underpins the functional relevance of several repeats. A notable feature of D. pulex is the high density of 17 bp repeats presumably associated to heterochromatin regions.

Comparing the 12 genomes, our results reveal an astonishing level of differences in TR characteristics among different genomes and different genomic regions, which even exceeds the level of differences found in previous studies. Extreme "outliers" concerning densities and repeat type usage (O. lucimarinus), even lead us to the conjecture that nature has not imposed general limitations concerning repeat type usage and densities of TRs in genomes. In view of several general and lineage specific TR characteristics that have been refuted in this analysis and in view of the still small number of taxa that have been compared, the existence of common TR characteristics in major lineages becomes doubtful.

Altogether, this study demonstrates the need to analyse not only short TRs but also TR with longer units, which contribute significantly to all genomes analysed in this study. Restricting an analysis to STRs leaves a great amount of genomic TRs go unnoticed that may play an important evolutionary (functional or structural) role.