Background

Short-sequence DNA repeats (SSR) are a type of satellite DNA in which a DNA pattern occurs two or more times. These patterns can be homogeneous or heterogeneous, can occur inter- and intragenically, and can be found in both eukaryotes and prokaryotes. While many SSR loci are in noncoding regions of DNA, we focus on those that occur intergenically and thus encode protein repeats as well. These longer intergenic SSRs are of interest as they can be used in genotyping, phylogenetic characterization and analysis of pathogenicity [1]. These loci can be identified for many species in genes encoding diverse functions: for example, SSRs are found in Haemophilus influenzae yadA, which encodes an adhesin [2], Staphylococcus aureus coagulase, coa, which is involved in clotting [3], and Streptococcus pneumoniae pspA, a surface protein implicated in pathogenic mechanisms [4, 5].

One species where SSRs have proven useful for genotyping is Anaplasma marginale, a bacterial tick-borne pathogen of cattle. Several factors make the development of a vaccine for this pathogen challenging, including the need for a deeper understanding of strain variation and distribution. A. marginale Major Surface Protein 1a (Msp1a), a protein with a set of SSRs near the amino terminus, has been extensively used for genotyping strains and cataloguing strain distribution. The msp1α repeats are 84–87 bp (28–29 amino acids) in length and vary in sequence and number [6]. To date, over 235 repeat sequences have been identified in the published literature. Cataloguing and tracking these repeats is difficult as diverse groups of researchers are involved in the identification of the repeats. Additionally, the task is error-prone when done manually. For example, there exist cases when the same name has been given to repeats with different sequences (specific instances are included in the Validation section of this manuscript), and conversely, the same repeat has been assigned more than one name (details in the Data Compilation section).

A reliable software tool capable of tracking, managing, analysing and cataloguing SSRs and genotypes can alleviate the aforementioned problems, fuel research and collaboration, and accelerate the path to the discovery of a vaccine. While there exist a variety of tools related to repetitive DNA in the literature, some with a subset of these functionalities, we found that few have sufficient data management capabilities to be used effectively in genotyping and none provide the analysis or visualization functionalities needed to gain insight into geographic genotype distributions. The tools Tandem Repeats Finder [7], scan_for_matches [8] and the ALGGEN software suite [9] can identify user-specified repetitive elements in DNA or protein sequences. However, these tools do not store the search sequences of interest and so they are not suitable to use with a large collection of named repeats. In addition, analysis and visualization functionalities fall outside of the scope of their intended uses. BLAST [10] is a related tool to these with a much more general purpose; it differs from our context, however, in that it enables one to search for a given element in a whole body of genes, rather than a known body of elements in a given gene. Repbase [11] is a database that stores repetitive DNA elements for a number of species for use in other tools. The tools that use the Repbase data, such as CENSOR [12] and RepeatMasker [13], are largely focused around removing repetitive elements rather than tracking or analysing them. Additionally, Repbase is limited to data from eukaryotes, which in turn limits the tools built around it. PSSRdb [14] is a database that stores repetitive gene data for many species of prokaryotes, including A. marginale and S. pneumoniae, but focuses on shorter repeating elements (1–6 bp) called simple sequence repeats. There is another class of related tools, with functionality not directly related to genotyping, that identify general repeating elements in DNA or protein sequences in a variety of ways. Examples of tools that fall into this category include [7, 1524]. While there are no tools specifically related to genotypic and geographic repeat analysis, we note that RepeatFinder [21] has some interesting analysis capabilities involving substructures of the repeating elements it identifies. Finally, we note that repeat finding in a broader sense is an extensively studied topic in bioinformatics. Sharma et al [25] provide a survey of tools used for mining microsatellites in eukaryotic genomes. Merkel and Gemmel [26] give a review of software tools for detecting short tandem repeats and a practical guide (aimed at biologists) on how to use the tools in an informed manner. Lim et al [27] provide a more recent review of tandem repeat search tools with a focus on algorithmic performance.

We developed RepeatAnalyzer, a new software tool designed to track, manage, analyse and catalogue SSRs and genotypes. The idea for RepeatAnalyzer originated from a very simple program for repeat identification developed by Carter Hoffman [28], and branched out to add many new features and analysis capabilities. Using a summary of repeat and strain data collected from the literature, RepeatAnalyzer currently maintains a list of all known A. marginale repeats and strains, the locations where these have been reported, and the original sources of those reports. When given the DNA or amino acid sequence of a gene/protein of interest, RepeatAnalyzer can determine which repeats the sequence contains and whether the genotype has been previously reported. RepeatAnalyzer also provides a variety of analysis capabilities, including: genetic diversity analysis for a specified geographical region; genotype analysis for a specified SSR, strain or location; and visualization of search results on geographic maps. While we developed RepeatAnalyzer using A. marginale as the model organism, the tool can also be used in the study of other bacteria with SSR loci.

Implementation

RepeatAnalyzer is a command line software tool, written in Python, for storing, managing, identifying and analysing a catalogue of SSRs for a gene locus of interest. These functionalities can be used in genotyping, phylogenetic characterization or pathogenic analysis. Figure 1 provides an overview of the central functionalities of RepeatAnalyzer.

Fig. 1
figure 1

A flowchart of RepeatAnalyzer’s functionalities. Each main branch (a-e) represents a type of functionality. Following each branch shows the program flow for each option

RepeatAnalyzer functionalities

  1. A)

    Input Newly Discovered SSRs and Genotypes. RepeatAnalyzer can add new genotypes and SSRs to its dataset from a plain text file specified by a user. The file is required to be in a simple but specific format (details are provided in the user manual). If data in the file already exist in the system, RepeatAnalyzer avoids adding the data again. Further, the program carries out a variety of checks to ensure consistency of data added to the dataset. If an occurrence of a genotype is found at a new location, that location is recorded. If a genotype or SSR with a known sequence is given a new name, that name is added to a list of current names for that genotype or repeat. If a repeat name that is in use is listed with a new sequence, the user is prompted to overwrite the old sequence, keep it in lieu of the new one or give it a new (unused) name. This is done to maintain the uniqueness of repeat names, which is necessary as the repeats present in a genotype are referenced by name only. The data input functionality of RepeatAnalyzer is shown in branch A of Fig. 1.

  2. B)

    Strain Identification. Given the sequence (either in DNA or amino acid form) of a known SSR locus for an organism from a species of interest, RepeatAnalyzer finds the maximal set of non-overlapping known repeats in that sequence and the name of the genotype, if any is stored. It also gives the locations where the genotype has been found previously and the sources of this information. The SSRs used for this functionality must have been input as described in section A. As mentioned in the Data Compilation section, the input file for A. marginale will be maintained along with the software, and files for other species may be kept similarly, if there is sufficient interest. However for any other species, the input file will need to be compiled by the user, as specified in the user manual.

The strain identification functionality is represented in branch B of Fig. 1. Upon selecting the appropriate menu item, the user can enter either the DNA or amino acid sequence for the gene of interest. Once the user has selected the appropriate option for the data entered, the program converts the input to amino acid form (if it is provided in DNA form), and then runs an implementation of the Knuth-Morris-Pratt (KMP) string searching algorithm [28] to search for all known repeats in the string (a schematic description of the KMP algorithm is provided in Fig. 2). Once this is completed, the program removes repeats that are substrings of other found repeats from the results before they are output. In the search for repeats, we chose the KMP algorithm because it is fast in practice (linear time) and has guaranteed worst-case behaviour compared to theoretically faster algorithms that do not offer the same reliability such as Boyer-Moore [29]. Suffix trees [30], which facilitate more efficient search procedures, are not suitable in our context because the search body is input at runtime (the time to build these structures, which is on the same order as running KMP itself, would counteract their search efficiency while consuming additional memory).

Fig. 2
figure 2

A flowchart of the Knuth-Morris-Pratt string searching algorithm. KMP is a computationally efficient algorithm for finding short text patterns in a long string of characters. offsets is a list of numbers. The length of offsets is the number of characters in pattern. Each entry in offsets corresponds to the distance counters i and j are adjusted when a mismatch occurs. len(item) denotes the length of item

  1. C)

    Analysis. RepeatAnalyzer provides two basic analysis functionalities for the repeat data it stores, which we have called diversity analysis and genotype analysis.

    1. i)

      Diversity analysis involves calculation of genetic diversity scores for a geographic region. A region can be defined as broadly as a country or as narrowly as a particular province or county. The genetic diversity is calculated in several different ways that we group into two categories. The first category of metrics (quantity-centric) measures the percentage of unique repeats in a region, while the second (distribution-centric) measures the regularity with which the repeats are distributed. The metric GD2, which was introduced in [31], is a variant of the quantity-centric category. It is simply the number of unique SSRs present in a region divided by the number of strains identified. We introduce a slightly modified variant of GD2, named GD2b, that can be calculated with only genotype data, as this is what RepeatAnalyzer stores. Specifically, GD2b is defined as the number of unique SSRs divided by the number of genotypes. In addition, we developed a new quantity-centric measure of diversity (called GDM1), and a new distribution-centric diversity measure (called GDM2). GDM1, like GD2 and GD2b, measures the amount of unique repeats in a region, but unlike GD2 and GD2b, is unaffected by the length of genotypes in the region. (Details on this are included in the Discussion). GDM2 measures how uniformly the repeat occurrences in a region are distributed. GDM1 and GDM2 each come in two variants, local and global, depending on whether the metric is calculated as an average of the values for each genotype or over the entire region, respectively. The mathematical definitions of these measures are summarized in Table 1.

      Table 1 Metrics used to calculate genetic diversity

      In addition to these numeric genetic diversity measures, RepeatAnalyzer calculates the frequency of each SSR in the region (by the number of genotypes in which it occurs) and lists the set of SSRs that are unique to the given region. Plots can (optionally) be produced of the distribution of repeat frequencies, repeat lengths or genotype lengths.

    2. ii)

      Genotype analysis, the second type of analysis RepeatAnalyzer provides, is a detailed summary of the compiled information about a single SSR, genotype or location. As an example, the compiled data for A. marginale SSRs in Mexico would include all genotypes found in Mexico, all SSRs in those genotypes, and all papers referencing these genotypes in Mexico or elsewhere in the world. The analysis for a particular genotype includes: a list of all the repeats the genotype contains, the edit distances (the number of amino acid differences between two repeats) between the repeats in the genotype and the mean edit distance; a list of all locations where the genotype has been reported; and a list of all papers reporting a strain with the genotype. For an SSR of interest, RepeatAnalyzer prints a summary of all locations where the repeat has been found, the genotypes in which the repeat is found and the sources in which the findings were reported as well as any SSRs within a given edit distance from that repeat’s sequence and any other SSRs of which it is a subset.

      The analysis functionality of RepeatAnalyzer is summarized in the C branch of Fig. 1.

  1. D)

    Search. While similar to the genotype analysis functionality, search returns a more succinct summary of information on a large number of repeats and/or strains at once, and it maps these results geographically. For a genotype of interest, a search could be performed on the genotype and/or a repeat present in the genotype. The search produces a map of every location where the genotype has been found as well as every instance of its composite repeats. Alternately, a set of similar genotypes could be searched to show where each of them occurs, or all genotypes of a region could be mapped. Examples of this functionality are shown in the Results section.

The search functionality is depicted in the D branch in Fig. 1. The input here can include one or more strains, one or more repeats, and a region. It is also possible to look for all strains or all repeats in a given location or all instances of a given strain or repeats across all geographic regions. The program processes this query and the user may choose to include a map plot of their search as well. The plot shows the locations of all the searched strains and repeats over a world map that can be zoomed to any region of interest. These maps can be saved in most popular image formats (for high resolution images the PDF or SVG formats are recommended) and the data printout can be copy-pasted into a word processor or spreadsheet.

  1. E)

    Summarise All Current Data. RepeatAnalyzer can also print a summary of all the repeats and strains it is currently storing, along with all the locations where these have been recorded and the sources. The printout is stored in a text file in the same folder as the executable file. If a strain or repeat has multiple names, the names are separated by semicolons. This simple functionality is shown in the E branch of Fig. 1.

Data compilation

To build, test and verify RepeatAnalyzer, we used A. marginale SSRs and genotypes we compiled by mining the literature. Specifically, we conducted a review of the literature examining papers that reported new SSRs or repeat sequences for genotyping. A total of 25 papers were reviewed and 21 were found to have genotype and/or repeat data [6, 3150].

We found 16 cases in the literature [32, 3537, 40, 45, 47, 50] where the same nucleic acid sequence was assigned a new name independently. When this occurred, we counted it as a single repeat with multiple names. Similarly, we counted unique genotypes only where a distinct sequence of repeats existed and thus the total number of unique genotypes we identified is significantly lower than the total number of strains reported. A table of all unique repeat sequences and all their published names is available in Additional file 1. We designed our software to account for the potential problem of multiple naming of SSRs by alerting users to existing SSRs with a given name before they are input into the program and allowing them to be renamed.

Results

In this section we demonstrate various functionalities of RepeatAnalyzer using (a) the A. marginale msp1α data we collected to validate RepeatAnalyzer’s functionalities and (b) some preliminary S. pneumoniae PspA SSRs and genotypes we identified. First, we show RepeatAnalyzer’s correctness on a large number of inputs and outline several types of errors we found in the literature in the process that the storage and management features of RepeatAnalyzer could prevent in the future. Next, we show several examples that illustrate the mapping and analytic capabilities of RepeatAnalyzer. Finally, we identify some preliminary PspA SSRs and genotypes and compare our genotyping strategy to one used in the S. pneumoniae literature.

Validation

In order to validate RepeatAnalyzer’s identification functionality we downloaded 380 published A. marginale msp1α sequences [31, 32, 36, 39, 42] from GenBank [51]. Using RepeatAnalyzer we compared the sequence of the repeats in GenBank against the sequence the isolates were reported as having in the corresponding publication. Of the 380 sequences examined, the results returned by RepeatAnalyzer corresponded with the literature in 369 cases. For the remaining 11 cases, manual curation revealed that the information reported by RepeatAnalyzer was correct in all of the cases, assuming the GenBank data is correct. Of the 11 erroneous cases, only five presented unique errors. In addition to these, RepeatAnalyzer had previously corrected five instances of a single name being assigned to two different repeats, which would have otherwise resulted in mismatches. For these 10 distinct errors, by comparing the published repeat sequence to the sequence in GenBank, we were able to classify the errors into two broad categories as i) mistaken repeat sequence and ii) mistaken repeat name.

  1. i)

    Mistaken Repeat Sequence. This error occurs when the published sequence of an SSR is different from the actual SSR sequence present in GenBank. Often the difference between the two sequences is small. This type of mistake is easy to make when dealing with a large number of sequences by hand, and it can lead to major confusion when a genotype is effectively mischaracterized as consisting of the wrong series of repeats. RepeatAnalyzer can prevent such errors by allowing researchers to simply enter the sequence of interest and have a known repeat returned with its name(s), avoiding the need to tediously compare each amino acid one by one. In most cases, we found that the mistaken repeat sequence is actually a new SSR. In such cases, we renamed the sequences by adding a “-2” to the end of the reported name. Instances of the mistaken repeat sequence we found are summarized in the top part of Table 2.

    Table 2 Mistaken repeat sequence and mistaken repeat name found in the A. marginale literature
  2. ii)

    Mistaken Repeat Name. This error occurs when the SSR is reported as corresponding to a previously published SSR, but actually corresponds to a different published SSR. While it is relatively easy to avoid, a mistake like this can be nearly impossible to notice without reanalysing the original sample. In the sequences we examined, we found fewer cases of mistaken repeat names than of mistaken repeat sequences. RepeatAnalyzer could also prevent mistaken repeat name errors, since the set of SSRs in a sequence is returned by their names, removing the need for any manual checking. Instances of the mistaken repeat name we found are summarized in the bottom part of Table 2.

Examples using A. marginale data

Because there is already a large collection of A. marginale SSRs (over 235) and genotypes (over 350) available, we frame the following examples around the A. marginale data we collected. But the same analysis can easily be applied to SSRs and genotypes for any prokaryote that has distinct SSRs. In a subsequent section we will give an example of another species for which SSR genotyping could provide beneficial insights and for which RepeatAnalyzer’s functionalities work seamlessly.

Regional diversity analysis

RepeatAnalyzer has several ways to characterize the diversity within a given geographical region. First, it can easily compute the frequency, within known genotypes, of each SSR that occurs in a region. Second, it can show the list of SSRs unique to a given region (by unique we mean that they have not been reported in an isolate from any other area). RepeatAnalyzer can also produce plots of distributions of SSR frequency, SSR length and genotype length. Finally, it can produce a variety of diversity scores that characterize SSR frequency more succinctly.

As an example, we include the diversity scores for two provinces of Mexico, Nayarit and Jalisco, and for Kansas, USA in Table 3. The table also includes worldwide scores for comparison. Based on GDM1-Local scores, we can see that Kansas has relatively few unique SSRs per genotype, compared to both the Mexican provinces and the world, while Jalisco has relatively diverse SSRs within its genotypes. Similarly, Kansas has low GDM1-Global value, meaning low SSR diversity in general, while Jalisco has a higher value for the same metric. For both GDM2 metrics, the world values are the lowest, which is expected as these values contain all known repeats, most of which occur only a few times. The Kansas scores, both in local and global terms, are relatively high meaning that, both within each genotype and in general, genotypes contain a more uneven mix of SSRs (i.e. many of one repeat and few of several others). In contrast, the two Mexican provinces have lower GDM2 scores, indicating that the repeats are more evenly distributed. Both Mexican provinces were found to have repeats unique to them: for Jalisco LJ2 is unique and for Nayarit EV1, EV5, EV9, and EV10 are unique. Kansas has fewer SSRs in general, and none are unique.

Table 3 Diversity scores for Nayarit and Jalisco, Mexico, Kansas and world data

The plots RepeatAnalyzer produces for the SSR frequency and genotype length distributions for Jalisco and the world data are included in Fig. 3. We found that generally genotype lengths are normally distributed around a mean for any given region, though the mean varies, while frequencies follow a power-law resembling distribution. Further, our analysis of SSR frequency distribution plots for various regions (data not shown) showed that over 90 % of repeats are 28 to 29 amino acids long, in agreement with what is commonly believed by researchers. In addition, we found that repeats of length 28 are approximately 0.35 times as likely as repeats of length 29 (with a standard deviation of 0.15). This is consistent with the global average ratio, where repeats of length 28 occur 0.34 times as often as repeats of length 29.

Fig. 3
figure 3

Repeat frequency and genotype length distributions. The four plots in the figure are histograms produced by RepeatAnalyzer for Jalisco, Mexico and whole world data. Plots a and c show the distribution of SSR frequencies by the number of genotypes in which they occur in the region. Plots b and d show distributions of genotype lengths. The Inset in figure c is zoomed in to show its middle segment in finer detail; RepeatAnalyzer automatically generates this type of inset when outlier values would make the indices on the table difficult to interpret

Genotype analysis

In addition to characterizing the level of diversity in a geographical region, RepeatAnalyzer can also be used to compare and analyse individual entities (either genotypes or SSRs). As an example, we discuss results we obtained for genotype α β β Γ, originally reported in Mexico. The analysis revealed that the genotype has been reported in Minas Gerais, Brazil and various locations around Mexico and that the average edit distance (found to be 8) between its SSR sequences is fairly high. For each repeat in the genotype, the analysis also showed every reported genotype containing that repeat and where it was found. Many of the locations where these repeats appear are similar, including a variety of provinces around Mexico, Brazil and Venezuela, in addition to Taiwan. However β and Γ are also reported in the Philippines and Γ is reported in Italy.

Further, when the investigator enters the desired edit distance, genotype analysis can identify repeats with one or more amino acid substitutions, in addition to the known SSR with the furthest edit distance. A summary of the results of such an analysis for the repeats in genotype α β β Γ are included in Table 4 together with results for other common SSRs for comparison. In this example, we searched with a maximum edit distance of three. Interestingly, α is found to have only two recorded repeats within edit distance 3, with repeat 108 at edit distance 1 and repeat Ph1 at edit distance 2, while β has more SSRs at each edit distance. Both α and β have significantly fewer similar SSRs (with an edit distance less than 3) than the other repeats analysed, including E and 27 which were selected because they have been reported in a similar number of regions as α and β. When examining the relationship of the repeats to each other, the simplest change would be a single amino acid change from the repeat being examined. Ergo, one might expect a given repeat to have the most common edit distance as 1 when comparing to other repeats. However, analysis of edit distances shows that the most common edit distance for the 235 A. marginale repeat data set is 5 or 6, with a slightly skewed normal distribution around this point (the distribution is skewed in that it has a longer tail to the right of the highest point). We posit that this is result of long evolutionary time, where multiple amino acid substitutions have accumulated and been selected for.

Table 4 Edit distance analysis results for some common repeats

Geographic distribution visualization

RepeatAnalyzer can also produce publication-quality maps of genotype and/or repeat occurrences over a geographic area. These maps can show any subset of repeats or genotypes and can be filtered by region. If a region filter is applied, the map includes only results that occur in the given region but shows where else they are found in the world. A map can be zoomed to any region of interest, and high-resolution versions can be saved at any zoom level. Figures 4 and 5 show maps resulting from two example queries. The scale of markers on the map can be set from 50 % of the default size up to 300 % of the default size as needed. Figure 5 has been zoomed in to show Mexico, and the full world version is available in Additional file 2.

Fig. 4
figure 4

Geographic visualization of repeats. The figure shows the output of the query: Repeats: 10; 11; 12; 13; 14; 15; B; C; α; β; Γ, Strains: None, Location: Any, Scale: 1. A version of this same map zoomed in to show Venezuela in detail is available in Additional file 2. The size of a circle indicates the scope of the region it denotes. This is necessary because while a location for a genotype must include a country, it may also (optionally) include a province and/or county. In these cases, the larger the circle is, the broader the scope; country only markers have the largest circles, while markers for a specific county are the smallest

Fig. 5
figure 5

Geographic visualization of repeats from Nayarit, Mexico. The figure shows the output of the query: Repeats: α; β; Γ; EV1; EV3; EV7; EV6, Strains: EV1 β β β Γ; α β β β Γ; EV1 β β Γ; EV3 EV7 β β EV6, Location: Nayarit, Mexico, Scale: 1.5. A version of this same map zoomed out to show the whole world is available in Additional file 2. Circles with grey outlines represent the positions of whole genotypes, rather than individual SSRs. Size still indicates the scope of the region represented, though genotype markers are strictly larger than SSR markers to allow both to be visible simultaneously at the same coordinate location

Preliminary analysis with S. pneumoniae data

To illustrate RepeatAnalyzer’s flexibility with respect to subject species, we collected S. pneumoniae pspA gene sequences from GenBank and genotyped the strains using RepeatAnalyzer. From the five sequences we downloaded, we identified 21 unique repeat sequences, each containing 20 to 21 amino acids. These SSR sequences are available in Additional file 3. The genes we examined contained 9 to 16 tandem SSRs. We found that each genotype was distinct (Table 5). Given the length and variability of repeats present in the pspA gene in virulent strains of S. pneumoniae, we suspect PspA SSR genotypes could be used to differentiate strains with a much finer discrimination than serotyping.

Table 5 SSR genotypes of S. pneumoniae pspA

Discussion

We developed RepeatAnalyzer, a program for tracking, managing, analysing and cataloguing SSRs and genotypes. RepeatAnalyzer automates naming of new SSRs to avoid the most common types of errors found in analyzing this type of data, and provides new metrics for SSR analysis. As mentioned briefly in the Implementation section, the metrics GD2 and GD2b shown in Table 1 are dependent on the length of genotypes in a region. For example, consider these two cases. Case 1: each genotype in a region has no repeating SSRs, say the strains found there are A B C, D E F and H I J. Case 2: a similar region with strains A B C D E F, H I J L M N, and O P Q R S T. In Case 1, the GD2 value is 200, where as in Case 2 it is 600. Intuitively, both cases have the maximum diversity possible, however increasing the length of the genotypes increases the diversity score, making values difficult to compare between areas with varying genotype lengths. GD2 captures a kind of diversity where more SSRs in the study population will raise the diversity score, while more strains with similar genotypes will decrease the score. In contrast, the GDM1 scores are more focused. GDM1-Global is the average percent of unique repeats across all genotypes in the region and GDM1-Local is the average percent of unique repeats in each genotype. For Case 1 in our example, both the GDM1-Global and GDM1-Local scores would be 1, or 100 %, indicating the maximum possible diversity is present. We argue these values are easier to interpret and are more meaningful. The second type of metric, GDM2 measures how often each repeat in a region occurs, and thus captures a completely different formulation of diversity. For Case 2 of our example, the GDM2-Global and GDM2-Local diversity measures are both 0. This tells us that both in each genotype and in general, each repeat occurs the same number of times.

The validation method we chose to use for RepeatAnalyzer, though it allowed us to test our identification functionality in a large set of test cases, relies on the assumption that GenBank data is correct. In particular, if a DNA sequence is reported correctly in a paper and uploaded incorrectly to GenBank, our validation method would mark that as an error. This can be viewed as a limitation. However, due to the rarity of a mismatch between GenBank and the literature we chose to trust the veracity of GenBank data.

When compiling A. marginale data, we found that the naming standards for strains vary considerably across sources. We found some papers [33, 45] that opted against naming strains at all, perhaps due to the uncertainty faced on how best to name strains. A naming scheme containing the location of the strain and the genotype was proposed by Cabezas-Cruz [32]. However, while an improvement, this nomenclature system leaves out year, which nonetheless might be an important piece of information in the future, for example, to determine whether the same genotype is occurring year after year in a particular region. Motivated by this, we propose a new naming convention that is short, information-rich and can be produced directly by RepeatAnalyzer: [genotype]_[country code,province code]_[year]_[animal id]. As an example, using this naming convention, a previously unnamed strain from Kansas would be named 6(D)E_US,KS_2004_8416 [48]. Adding animal id allows someone to quickly realize that in the 2004 Kansas study the 6(D)E genotype occurred seven times.

Conclusion

In summary, we have developed a new software tool for storing, managing, identifying and analysing short-sequence repeats for the purpose of strain identification. Our software can take a gene sequence and return the repeats it contains along with the known strain (if any) that the sequence belongs to. It does so by storing data distilled from sources on repeats at a given SSR locus. The data can be updated simply and searched easily for information about any known strains or repeats. All of these tasks are done in a computationally efficient manner using the KMP string matching algorithm and general programming best practices. RepeatAnalyzer can also produce a map for any combination of repeats and strains in a given region, offering geographic insights into their distribution not previously available. In addition, it can calculate metrics of diversity within geographic regions. We intend to maintain a periodically updated version of the A. marginale data that researchers can download and make contributions to.