Background

Large proportions of eukaryotic genomes are essentially composed of repeated sequences, including the human (approximately 45 to 78% [1, 2]), maize (approximately 80% [3]), and salamander (approximately 50% [4]) genomes. Among these repeated sequences, transposable elements (TEs) represent the most significant contributors in terms of sequence coverage and therefore have a major influence on genome evolution, especially on genome size [5]. In contrast to other repeated sequences, TEs consist of a wide diversity of sequences; in addition to the separation in classes based on the transposition intermediate (RNA versus DNA), many subfamilies are described inside each class, corresponding to elements with particular sequence features, and many efforts were made to unify the classification system for all of these elements [6, 7].

With the ever-growing number of whole genome sequencing projects, the identification of TEs becomes necessary to fully characterize the evolutionary dynamics of genomes. Different methods of TE identification have been developed during the past 15 years, with the majority designed to determine TE content in assembled genome sequences produced by the classic Sanger sequencing method (for reviews, see Bergman and Quesneville [8], Saha et al.[9], and Lerat [10]). These methods group three main types of approaches to recover TE sequences: homology-based approaches that search for a reference sequence in a query genome; structure-based approaches that search for particular structural features of certain TE classes, such as the presence of two long terminal repeats (LTRs) at the extremities of LTR-retrotransposons; and de novo approaches that principally use the repetitive nature of TEs to discover them.

More recently, with the emergence of next generation sequencing (NGS) technologies, new efforts were made to develop novel tools to detect TEs because previous methods are not directly applicable to reads produced by NGS data [11, 12]. However, one of the most commonly used procedures to find occurrences of known TEs remains the homology-based method proposed by the RepeatMasker program [13] because it is easy to use, rapid, and efficient [14, 15]. The main drawback of this program is its dependence on reference sequences and consequent inability to discover new TEs. This method however remains a must for identifying TE sequences in an assembly or after the identification of new consensus TE sequences using de novo methods. For example, this last approach (de novo TE libraries used with RepeatMasker) was applied for the identification of TEs in the 12 Drosophila genomes [16].

The principle of RepeatMasker is to search for the occurrence of any reference sequence contained in a library (currently Dfam [17] and RepBase [18], or user-built in) in a query sequence using a sequence comparison approach based on popular search engines including nhmmer, cross_match, ABBlast/WUBlast, RMBlast, and Decypher [19]. RepeatMasker generates several output files, including the .out file, which provides a detailed annotation of all detected repeats in the query sequence, specifically including their position, orientation, and divergence from the reference sequence [19]. This .out file is particularly useful because it identifies the part of the query sequence that matches a given TE family of a library (a ‘hit’) and provides its position in the query sequence for each one. However, a remaining challenge consists of identifying the different copies of elements corresponding to those ‘hits’ , which is a prerequisite for any evolutionary or comparative analysis of different copies of a family.

Some scenarios in particular can lead to multiple hits corresponding to a unique copy of an element. The first scenario, in the case of a LTR-retrotransposon, comes from the split of its consensus into a sequence corresponding to the LTR and a sequence corresponding to the internal portion of the element (Figure 1A). This separate annotation for LTR-retrotransposons is supported to facilitate the identification of solo-LTRs, which may be numerous in some genomes [20]. Multiple hits corresponding to only one copy of a given element can also result from large deletions (Figure 1B) or insertions that occur in sequences and disrupt the entire copy, leading to nested TEs (Figure 1C). Moreover, the presence of undetermined bases, which may occur due to low sequencing quality, could also disrupt unique sequences corresponding to a copy and give multiple hits. Taken together, these characteristics induce multiple hits corresponding to a unique copy for a given TE in the RepeatMasker .out file. Finally, non-significant hits can be present in the output file, in addition to sequences that do not fit the 80-80-80 rule [6], that is, sequences that would align with the reference on less than 80 bp, on less than 80% of their respective length, and with less than 80% of identity.

Figure 1
figure 1

Examples of multiple hits in the RepeatMasker . out file corresponding to one copy. (A) Multiple hits due to separation between the LTR and internal regions in the consensus sequence in the Repeat Library for the LTR-retrotransposon Tirant in D. melanogaster. Three hits are obtained with RepeatMasker, which correspond to one complete copy. (B) Two large deletions lead to the identification of three hits in RepeatMasker corresponding to one incomplete copy of the Helena element in D. melanogaster. (C) Example of one TE inserted into another. The Quasimodo LTR-retrotransposon is inserted in the reverse orientation by a hobo element, which is incomplete due to an internal deletion. While six hits are proposed by RepeatMasker, they in fact correspond to two copies, one for each element. LTR, long terminal repeat; TE, transposable element.

Some programs proposing the use of RepeatMasker output files were developed [2123], but none allow access to the location of all of the TE copies or an accurate quantification of the family content at the genomic level. These programs usually have very specific aims. TSDFinder was developed to refine the coordinates of long interspersed nuclear element (LINE) L1 insertions by identifying flanking target site duplication (TSD) sequences and the poly(A) tails of 3′ intact L1 insertions in the human genome [21]. The LTR-miner program was designed to specifically retrieve information concerning the age and distribution of LTR-retrotransposons [22]. This program was then implemented in the Reannotate program for use on all categories of TEs to estimate the temporal order of insertions in the case of nested elements and to estimate the age of LTR-retrotransposon copies [23].

In this manuscript, we propose a perl tool (available at http://doua.prabi.fr/software/one-code-to-find-them-all) that parses the RepeatMasker .out files to accurately determine the number of TE copies found, obtain their positions, and retrieve their sequences. This tool should be helpful for any non-bioinformatics scientist interested in genome annotation and/or evolution. To our knowledge, this program is the first multi-purpose tool that correctly identifies TE copies using RepeatMasker and provides complementary quantitative information for individual families in a query sequence.

Implementation

The proposed tool consists of two perl scripts that must be run successively to take into account the different characteristics of the consensus sequences.

Script 1: build_dictionary.pl

This script builds a list of all of the LTR-retrotransposons found in the query sequence at least once by RepeatMasker to associate hits corresponding to the internal portion and those corresponding to LTR sequences. This module uses the RepeatMasker .out file or a directory path containing several RepeatMasker .out files as input. RepeatMasker files are recognized based on their .out extension, allowing the program to run recursively on large file structures without prior file sorting (for example, working on one organism by running the program on an entire directory downloaded from a genome database). Then, the program matches together internal and LTR portions, based on name similarity. The main issue with this step relies on heterogeneity in the annotation of LTR-retrotransposons in the library. For example, the majority of LTR-retrotransposons in Drosophila melanogaster appear under the name ‘TE_LTR’ and ‘TE_I’ for the LTR sequence and the internal sequence, respectively. However, the members ‘LTR’ and ‘internal portion’ may sometimes have different names. This scenario is, for example, the case for the LTR-retrotransposon HMS-beagle, for which the corresponding LTR sequence is labelled DMLTR5, while the internal portion is labelled HMSBEAGLE_I. The same problem occurs more frequently for retrotransposons in Homo sapiens, making it difficult to derive a completely generalized algorithm to determine the concordance between the LTR and internal portions. These issues imply that the output file of build_dictionary.pl must be manually inspected to correct for any mis-association.

In the standard version (see --fuzzy option for the alternative version), the program only recognizes similar names in addition to the ‘LTR’ or ‘int’ suffix or prefix, taking into account small discrepancies such as a ‘-’ symbol replaced by a ‘_’ symbol, for example, recognizing the association between HERV-Fc2-int and HERV-Fc2_LTR.

The following parameter must be provided in the program:

  • --rm infile (corresponds to a RepeatMasker .out file or the name of a directory containing several RepeatMasker .out files).

Three options can be specified by users:

  • --help

    This option prints a summary of the different usages of the script.

  • --fuzzy

    This option allows the script to associate more LTR names with internal counterparts to account for the possibility of LTR variants. In three successive passes, the program associates similar names differing by a single letter, a single number, or two characters. For example, in the human genome, the --fuzzy option allows for the association of MER66-int with its various counterparts MER66A, MER66B, MER66C, and MER66D or HERV1_I-int with HERV1_LTRa, HERV1_LTRb, HERV1_LTRc, HERV1_LTRd, and HERV1_LTRe.

  • --unknown

    To be used in particular cases where the RepeatMasker program was run using a local TE library without the class/subclass specification (see below).

Finally, the name and path of the output file should be specified using a redirection (>dictionary_output.txt). Examples of command lines are detailed in the tutorial available on the program website.

Script 2: one_code_to_find_them_all.pl

The second script uses the output file produced by build_dictionary.pl and a RepeatMasker .out file (or a directory containing several RepeatMasker .out files). The principle of this program is to compare the positions and orientation of each hit corresponding to the same TE family to determine whether the hits correspond to the same copy and can be merged or correspond to different copies. Two hits located on the same scaffold or chromosome are considered to be fragments of the same copy if they abide by the three following conditions: 1) they have the same orientation; 2) the extremities of the fragments respect a distance criterion: by default the furthest extremities should be separated by less than twice the length of the reference TE element (see the --insert option for non-default behavior); and 3) the second fragment starts and ends after the first one respectively starts and ends (that is, the two fragments can overlap but cannot be included in one another). These constraint filters were motivated by a conservative choice, meaning not to merge copies that do not belong to the same insertion. However, one shortcoming of this methodology is that it may be impossible to re-assemble old copies in which many insertions of other elements had taken place after this copy was first inserted in the genome. Moreover, we may over-estimate the copy number if a portion of a given copy is inverted, leading to several fragments in different orientations.

The identification of unique copies of LTR-retrotransposons depends on the different fragments and different portions of the element (LTR and internal portions), as follows. First, we identify different fragments of the same portion that could be later assembled as a copy. For that purpose, two LTR fragments must not be separated by a compatible internal fragment, and two internal fragments must not be separated by a compatible LTR fragment. These steps are necessary for the merging of fragments into a copy. Once all copies are reconstructed from the RepeatMasker hits, the program assembles full-length LTR-retrotransposons by associating LTR copies and their corresponding internal copy located closely to one another. Conditions for associating a LTR sequence with an internal sequence include the following: the LTR sequence must be in the same orientation as the internal sequence, and it must be separated from the internal sequence by less than half the LTR length. The reconstruction of full-length ‘LTR-I-LTR’ elements is performed as a priority, and with the remaining copies, incomplete ‘LTR-I’ or ‘I-LTR’ elements are then built. All copies, assembled or solo, are reported. As solo-LTRs are of special evolutionary interest, they are reported separately from the full-length and partial LTR-retrotransposon copies in the summary file .copynumber.csv (see below).

The parameters required by the program include the following:

  • --rm infile (corresponds to a RepeatMasker .out file or the name of a directory containing several RepeatMasker .out files).

  • --ltr output file from build_dictionary.pl (Script 1).

Several options can be specified by users:

  • --help

    This option prints the possible usages of the script.

  • --strict

    This option makes the program use a rule based on the 80-80-80 rule [5] to select hits. In this case, the program provides copies with sizes greater than 80 bp long and which have greater than 80% identity to the reference element. By default, the program gives all hits found, regardless of the size or percentage of identity compared with the reference.

  • --length ‘length_file’

    This option allows users to work with their own file for the length of the reference elements, which will be used to determine the ratio of the length of a given copy compared with its reference. If not provided, the code computes the length of all elements (LTR and internal portions separated for the LTR-retrotransposons) present in all .out files under study, by selecting for each element the most common consensus length (as in some cases multiple RepeatMasker consensus sequences can correspond to the same element). This option is valuable when working with elements whose annotation is ambiguous to ensure that the correct reference length is used. It can also be used with another purpose, when only a subset of TEs is considered, since only the elements mentioned in the .length file will be taken into account.

  • --choice

    This option allows users to manually resolve ambiguous situations by choosing their favorite solution for merging hits. For example, Figure 2A shows a case in which two choices are possible, that is, two different hits can be assembled with the one under study (DM297_I at position 21,407,284 on the chromosome X). In this case, the first choice (solution 0) is the most parsimonious. Solution 0 is always the one corresponding to assembling closest hits together. However, this solution may come to a fault in the case of multiple nested or duplicated TEs corresponding to the same reference element. For example, in Figure 2B, solution 1 is the most parsimonious, that is, the one that minimizes the reorganization of the copy compared to the reference element structure. If this option is not specified, the default choice consists of choosing solution 0.

    As many ambiguous cases can arise, the RepeatMasker block ID (column 14 of RepeatMasker .out file) is used when this option is activated. These IDs come from the ProcessRepeats script implemented in RepeatMasker, which makes educated guesses if any pair of fragments is derived from the same element or not. Therefore, if an ambiguous situation can be solved unequivocally using these Block IDs, no choice is left to the user, and the elements sharing the same Block ID are merged.

    Another way of quickening the choice process is to only ask the user about ambiguous cases, and sometimes a single choice can disambiguate multiple situations. For example, consider the situation for which three fragments A, B, and C are considered for merging, and for which the choice is between A-B and A-C (choices are always pairwise). If the user considers the right choice to be A-B-C, he/she will choose A-B. Then, if adding C to the merged A-B is not ambiguous (if there is no D fragment of the same element nearby to get confused with, for example), the code will directly merge C with A-B, getting the right result A-B-C without asking the user about this final merging.

  • --dry-run

    This option performs all operations, but reports no results except the log file with all operations performed. It is designed to be used in tests, particularly those determining the number of ambiguous situations to be resolved. Running the program with this option before the actual analysis allows estimation of the time required to complete an analysis with the --choice option because the number of ambiguous situations can be high, and manual choice is time-consuming if applied to all elements in a genome.

  • --unknown

    In the particular case in which the RepeatMasker program was run using a local library that did not use the naming system required to differentiate the class and the subclass (required format is described in RepeatMasker help file), the .out file is slightly different because column 11 (repeat class/family) is usually filled with ‘Unknown’ or ‘Unspecified’ , which means that the type of individual TE is not specified. To account for this possibility, the user can use the --unknown option, which will produce results for elements annotated as ‘Unknown’ or ‘Unspecified’ and deriving from the local, unannotated bank.

  • --fasta and --flanking ‘size_in_bp’

    The --fasta option allows for the retrieving of sequences of copies reported by the program from the local fasta sequence files used in the RepeatMasker program. To study flanking sequences of the determined copies, the --flanking option can be specified to allow the program to report the flanking regions of the specified size surrounding each copy in addition to the TE sequence.

  • --insert ‘size_in_bp’

    This option changes the code behavior for merging fragments into copies. By default, the furthest extremities of the considered fragment to be merged are compared, and merging takes place if they are less than twice the reference element length apart. Using --insert, the size of the genomic sequence between the two closest extremities of the considered fragments (that is, the size of the insertion between them) will be considered: if it is less or equal to the size given in the option, the fragments are merged. For example, using --insert 0 means only fragments detected right next to each other in the query sequence will be considered as parts of the same copy.

By default, five output files are generated, which are located in the same directory as the RepeatMasker .out file(s), plus one output file located in the working directory (.length file) that is produced only if the --length option was not specified.

The .log.txt file contains the screen output of the program. For each element, this file summarizes the number of hits and copies obtained after merging the hits. When the --dry-run option is chosen, it displays the possible choices that would be asked using the --choice option.

The .copynumber.csv file contains quantitative information about each of the identified TE families in the query sequence. This file displays eight columns (see Figure 3A as an example corresponding to some DNA transposons and LTR-retrotransposons detected on the long arm of the chromosome 2 (2L) of D. melanogaster): column 1, Family, category of the given TE (as specified in the column 11 ‘repeat class/family’ of the RepeatMasker output file); column 2, Element, name of the given TE (as specified in the column 10 ‘matching repeat’ of the RepeatMasker output file); column 3, Length, length of the reference TE in bp (information from the consensus sequences, as found in the .length file). In the absence of either the internal or LTR portion of a LTR-retrotransposon in the query files, the column will specify ‘No_ref_available’; column 4, Fragments, number of hits found by RepeatMasker corresponding to a given TE; column 5, Copies, total number of copies reconstructed from the hits (if the --strict option was selected, this number can be null, meaning that none of the fragments passed our 80-80 rule); column 6, Solo_LTR, number of solo-LTRs reconstructed from the hits. The column will specify ‘NA’ for non-LTR elements; column 7, Total_Bp, total number of base pairs corresponding to a given TE for the analyzed query sequence; and column 8, Cover, percent coverage of a given TE in the analyzed query sequence.

Figure 2
figure 2

Two examples of choice as proposed by the program. (A) The best global solution appears as a concatenation of blue, red, and green fragments. Because the program works locally only on a pair of copies, the best choice to start with is 0, assembling blue and red; after this selection, the program will automatically detect that the assembled blue-red could be concatenated with the green fragment and either propose it to the user if there is ambiguity or assemble them together if the case is unambiguous. (B) The best global solution appears to be the concatenation of blue and green (solution 1) rather than blue and red (solution 0).

Figure 3
figure 3

An example of the output file .copynumber.csv. The tabulated format allows easy reading in many software programs commonly used to plot graphs, such as Excel or R. (A) The beginning of the file displays different DNA transposon and LTR-retrotransposon families. Rows starting with ### summarize the information for the above families. (B) The end of the file displays a summary for all of the repeat types. LTR, long terminal repeat.

For each TE category (DNA transposons, LINEs, short interspersed nuclear elements (SINEs), and LTR-retrotransposons), the global information concerning the number of fragments, number of copies, number of base pairs, and percent coverage are given and correspond to lines beginning with ‘######Type:DNA’, ‘######Type:LINE’, ‘######Type:SINE’ , and ‘######Type:LTR’. The column ‘length’ in this case contains a NA. For example, in Figure 3A, the DNA/hAT transposon hobo (reference length of 3,016 bp) has 40 fragments on chromosome 2L corresponding to 21 copies. These copies span 20,529 bp on chromosome 2L, which represents 0.09% of this chromosome. The end of the file gives global information concerning all TEs (and thus the coverage of all TEs on the analyzed sequence), satellites, low complexity regions, simple repeats, and unknown repeat elements (see Figure 3B).

The *.ltr.csv and *.transposons.csv files (see Figure 4 as an example) contain the list of all occurrences of LTR-retrotransposons, and of non-LTRretrotransposons and DNA transposons, respectively, which were identified by the program. In these files, the columns globally correspond to those proposed in the RepeatMasker .out file, with the exception of the sixth and the last two columns. The (left) column of the RepeatMasker file, the sixth one, is replaced with the length of the reconstructed copy, from the consensus point of view (that is, it can be different from the span on the query sequence). The ‘Num_Assembled’ column corresponds to the number of hits assembled into the different copies. The ‘%_of_Ref’ column represents the proportion of length of the reconstructed copy compared to the reference element. This ratio is expected to be 1 if the reconstructed copy is the same length as the reference element. These numbers thus provide information about the integrity and quality of the copies inserted in the genome; that is, for a given family or superfamily, copies which are mostly full-length (ratio close to 1) and with low divergence from the reference, could result from recent insertion events. In the case of solo-LTRs, that is, copies that only correspond to the LTR section of a consensus, the ratio is computed in reference to the length of the LTR sequence. This implies that full-length solo-LTRs will have a ratio of 1.

Figure 4
figure 4

Examples of the output files *.transposons.csv and *.ltr.csv. Lines starting with ### are full copies. Fragments that have been used to assemble the copy are indicated below these lines.

Individual copies of TEs correspond to lines beginning with the # character followed by the identification number of the merged hits with each one separated by a slash (/). For those that have been reconstructed using several hits, the fragments used to build the considered copies are shown below. For each copy, the ‘%_Div’ (percentage divergence to the reference), ‘%_Del’ (percentage of deletion compared to the reference), and ‘%_Ins’ (percentage of insertion compared to the reference) are the means of the values of each fragment normalized by size.

In the example in Figure 4, the first copy corresponds to a single fragment of a Doc element, which is nearly complete, whereas the copy below has been reconstructed using three fragments that also correspond to a Doc element. The third example corresponds to a complete copy of the copia LTR-retrotransposon, which has been reconstructed with respect to the separation of the ‘internal portion’ and ‘LTR’ in the consensus library. The last example corresponds to a full-length solo-LTR of copia2. The position of each copy is provided in columns 5, 6, and 7, which correspond to the name of the query sequence, position of the start of the copy in the query sequence, and position of the end of the copy in the query sequence, respectively. The orientation on the strand (+ or complementary) is specified in column 9. In the example in Figure 4, the reconstructed copy of the copia element is located in the long arm of chromosome 2 (chr2L), starts at position 3,073,087, ends at the position 3,078,231 (is 5,145 bp long), and is on the positive strand (+).

The last output file, .elem_sorted.csv, contains the same information as the two previous ones, but sorted per genomic position and not per element, in order to be easily used by people interested in the genomic context and distribution of TEs.

Results and discussion

To determine the accuracy of the program, we tested it with several RepeatMasker .out files corresponding to two organisms, D. melanogaster and H. sapiens, for which the TE content has already been largely described and which present great differences in genome size, TE content, and TE families.

Test of the D. melanogaster genome RepeatMasker output files

We retrieved the RepeatMasker .out files (thereafter mentioned as UCSC files) from the UCSC Genome Bioinformatics website (http://genome.ucsc.edu/), which were produced using version dm3 (April 2006) of the genome sequence with the 17 May 2007 (open-3.1.8) version of RepeatMasker and library release 20061006. Each file corresponds to a different chromosome (2L, 2LHet, 2R, 2RHet, 3L, 3LHet, 3R, 3RHet, 4, U, Uextra, X, XHet, and YHet). We did not retrieve the file corresponding to the mitochondrial genome. We also retrieved the unique RepeatMasker .out file (thereafter mentioned as RM file) provided for the same genome version on the RepeatMasker website (http://www.repeatmasker.org) using the library release 20080611 and open-3.2.5 version of RepeatMasker. This file contains the results for all chromosomes.

Determining the number of ambiguous cases that may require manual inspection (option --dry-run/--choice)

The option --dry-run was used with the UCSC files to determine the number of ambiguous cases that could be manually expertized. For all chromosomes, 862 cases appeared (see Additional file 1: Table S1 for individual chromosome detail). We investigated the cases corresponding to chromosome 3R for which eight ambiguous cases were identified. For all but two cases, the default solution 0 was the best choice from a biological point of view (minimizing the reorganization of the copy compared to the reference element structure). For the two remaining cases, the best choices were solution 1 and the last solution (not assemble the fragments). For chromosome X for which 14 ambiguous cases were indicated, solution 0 was the best choice in ten cases and the last solution (to let the first fragment alone) was the best choice for four cases. This result indicates that the default choice made by the program is the best choice (the most biologically sound) in the majority of cases.

Running the program with and without the --strict option

We did not initially specify use of the --strict option and successively ran the program with the UCSC and RM files. When the --strict option is not specified, the program considers every hit without filtering using our 80-80 rule. We observed the same amount of TEs globally (both in terms of copy number and chromosome coverage, see Additional file 2: Table S2 and Additional file 3: Table S3) for the two versions of the Repeat Library used with slightly more copies detected in the RM file (208 more copies, see Additional file 2: Table S2). This observation can be explained by the fact that the library used in this case was more recent and thus capable of containing new reference elements. In the results from the UCSC files we observed that the DNAREP1 element was associated with the repeat class family LINE/Penelope, as proposed when it was first described [24], whereas it is now known to correspond to the repeat class family DNA/Helitron[25]. In the annotation from the RM file, the association is correct, indicating that the Repeat Library used by UCSC incorrectly assigned this element to the LINE category, which was later corrected in a new version. We therefore chose to consider only the output file from the RepeatMasker website (RM file) for the rest of the test. This underlines the importance of a correct TE classification to obtain an accurate amount of particular elements.

Table 1 displays the number of copies per chromosome with and without the use of the --strict option. As expected, the global number of copies decreased from 9,134 to 5,656 copies in the euchromatin portion of the genome when the 80-80 rule was applied. This last number is congruent with the 5,409 annotated copies in the D. melanogaster euchromatin in the FlyBase annotation version r5.49 (http://flybase.org) [26]. The results also showed that the copy number in unplaced chromosomes is particularly high, indicating that the euchromatin is far from a complete reflection of the entire genome in terms of TE content. While heterochromatin regions display less TE copies (5,066 copies without the --strict option and 3,451 copies with the --strict option), TEs represent a large coverage of these regions (approximately 60% on average, see Additional file 4: Table S4).

Table 1 Copy number per chromosome for each category of TEs in D. melanogaster

Using the output files *.transposons.csv and *.ltr.csv, which contain details for the copies for each heterochromatin chromosome, we retrieved all of the potentially full-length elements by selecting copies whose ratios compared with the reference were over 95% (%_of_Ref, column 17). We obtained 474 copies corresponding to this criterion, which is more than the 202 full-length elements previously described [27] but that includes 130 full-length solo-LTRs. We did the same to determine the number of potentially full-length elements in euchromatin regions and found a total of 655 elements (1,039 elements when counting the highly represented DNAREP1, which is no more active and full-length solo-LTRs (170 copies)). This number is higher than the 478 full-length elements described with an older version of the D. melanogaster genome, which annotated only 1,572 TE copies [28]. This result demonstrates that our program can quickly identify potentially full-length elements.

In terms of proportion, the global TE content on chromosomes is congruent with what was previously shown [26, 27] with an average of 6.69% (6.04% with the --strict option) of TEs in euchromatin regions (without taking into account chromosome 4) and 61.63% (52.53% with the --strict option) of TEs for heterochromatin regions (see Additional file 4: Table S4).

Another example of what can be directly performed using the outfiles *.transposons.csv and *.ltr.csv is displayed in Figure 5. The divergence of sequences (%_Div, column 2) was plotted against the size ratio for each copy compared with the reference element (%_of_Ref, column 17) for each superfamily in the euchromatin portion of the genome (chromosomes 2L, 2R, 3L, 3R, 4, and X). This procedure can allow the quality of the copies inserted into the genome to be determined quickly; that is, for a given family or superfamily, if the copies are mostly full-length (ratio close to 1) and not divergent from the reference, this could indicate recent insertion events. For example, in Figure 5, the elements from the LTR/Copia superfamily (including the families copia, copia2, FROGGER, and 1731) mainly correspond to highly conserved copies (with a small divergence compared to their reference) with two populations of copies: one corresponding to almost full-length copies (potentially recent insertions) and the other corresponding to short copies. When looking in more detail, the populations of conserved copies of small sizes correspond mainly to copia2 copies but do not represent solo-LTRs (see Additional file 5: Figure S1 for individual representation of copia, copia2, FROGGER, and 1731 families). The same information can be produced for the other LTR-retrotransposon classes (Additional file 6: Figure S2 and Additional file 7: Figure S3 for individual family representations of Gypsy and BEL/Pao elements, respectively). Elements from the LINE/LOA superfamily, which in this case correspond to only one family (the Baggins family), had copies with low divergence compared to the reference but with different sizes, and a few of them were full-length, which could illustrate the same date of activity for the different copies and the transposition mechanism for LINE-like elements, which can be truncated at their 5′ end upon insertion. Thus, globally, we can easily obtain information concerning the population of copies of a given family and their positions in the genome.

Figure 5
figure 5

Plot of the divergences according to the size ratio. The divergence (column %_Div in files *.transposons.csv and *.ltr.csv) of sequences has been plotted against the size ratio of the copy compared to the reference element (column %_of_Ref in files *.transposons.csv and *.ltr.csv) given for each superfamily of LTR-retrotransposons (upper left panel), non-LTR-retrotransposons (lower left panel), and DNA transposons (right panel). Each point corresponds to a copy. Copies with a divergence close to 0 and ratio close to 1 correspond to potentially active and full-length copies. As the divergence increases and ratio decreases, corresponding copies are more degraded. LINE, long interspersed nuclear element; LTR, long terminal repeat.

Test of the tool using the H. sapiens genome RepeatMasker output files

We retrieved the RepeatMasker .out file from the RepeatMasker website (http://www.repeatmasker.org), which was produced using the hg19 version (February 2009) of the genome sequence with the open-3.3.8 version of RepeatMasker and Repeat Library 20120124. This file contains results for all chromosomes, that is, 22 autosomal chromosomes and the two sex chromosomes (X and Y) that we considered in the test. We did not take into account results corresponding to randomly placed sequences, unplaced sequences (chrUn), and particular regions of chromosome 6 (corresponding to different haplotypes of the major histocompatibility complex region), chromosome 4, and chromosome 17.

Determining the number of ambiguous cases that may require manual inspection (option --dry-run/--choice)

We determined the number of ambiguous cases that could be manually expertized for our file. For all of the considered chromosomes, a total of 12,133 possible choices appeared, which could potentially be investigated (see Additional file 8: Table S5 for the number by chromosomes). This large number indicates that complete manual annotation would be impossible to manage; however, by reducing the analysis to some TE families of interest, it would still be possible.

Running the program with and without the --strict option

We ran our program with and without the --strict option. Table 2 displays the percent coverage for each TE class in each chromosome and the two cases. The average coverage for each TE class without the --strict option was congruent with the admitted TE content in the human genome with 3.23% DNA transposons, 19.85% LINEs, 13.16% SINEs, and 8.73% LTR-retrotransposons, representing a total of 44.98% TEs in the genome [1].

Table 2 Percent coverage of each chromosome and for each class of TE in H. sapiens with and without the -- strict option

One original feature of our program is the ability to compute detailed quantitative information chromosome by chromosome, which differs from the output table produced by RepeatMasker. This feature allows us to show that the representation of each TE class differs according to the chromosome. For DNA transposons, chromosomes 3 and 20 displayed the highest proportion of these elements (4.05% and 4.17%, respectively), whereas the Y chromosome is particularly poor in elements of this class with only 0.79%. The X chromosome contains the highest proportion of LINEs and LTR-retrotransposons (33.71% and 11.38%, respectively) with chromosome 22 harboring the lowest proportion of the same elements (10.95% LINEs and 4.64% LTR-retrotransposons). Finally, SINEs are particularly abundant on chromosome 19 (26.98%) and rare on the Y chromosome (4.38%). Globally, the X chromosome has the highest proportion of TEs (58.77%), whereas the Y chromosome has the lowest proportion of TEs (23.84%). This observation is congruent with the discrepancy observed for particular families between the autosomal and sex chromosomes [29].

We examined the base coverage proportion for the most represented TE families in each chromosome (Figure 6). For each chromosome, the most represented LINEs mainly correspond to L1 and then L2 (Figure 6A). The two most represented SINE families include Alu and MIR (Figure 6B). Among the LTR-retrotransposons, the most represented elements correspond to the MaLR families in all chromosomes except chromosomes 19 and Y in which they correspond to the ERV1 families. The ERVL families correspond to the third most represented LTR-retrotransposons in all chromosomes (Figure 6C). Among the DNA transposons, the TcMar_Tigger families are the most represented in all chromosomes with the exception of chromosomes 1 and 2 in which the hAT_Charlie families are the most abundant.

Figure 6
figure 6

Percent coverage of the most represented TE families in the human genome for each chromosome. (A) L1 and L2 elements are the two most represented LINE superfamilies with L1 representing the highest proportion. (B) Alu and MIR elements are the two most represented SINE superfamilies with the Alu representing the highest proportion. (C) ERV1, ERVL, and MaLR are the three most represented LTR-retrotransposon superfamilies. (D) hAT_Charlie and TcMar_Tigger are the most abundant DNA transposon superfamilies. LINE, long interspersed nuclear element; LTR, long terminal repeat; SINE, short interspersed nuclear element; TE, transposable element.

The same global distributions are observed when using the --strict option, which takes into account elements that follow our 80-80 rule. However, the global amount of each class decreases with an average of 25.48% of the genome (Table 2). The elements following this rule are expected to be well-conserved, suggesting that these elements were potentially active until recently. Indeed, the most represented families correspond to those known to have had a recent activity (Table 3) such as LINE L1 and SINE Alu[26]. Among Alu elements, the most represented families correspond to AluJb, AluSz, AluY, AluSx1, and AluSx, which usually represent more than the half of the total Alu s. However, the most represented LTR-retrotransposons correspond to the ERV1 and MaLR families, and only ERVK elements are supposed to remain active [30].

Table 3 Percent coverage of each chromosome of the most represented TE families using the -- strict option

Conclusions

We have developed a tool to conveniently parse the classic RepeatMasker .out file to improve the original annotation provided, by including reconstruction of full-length copies. This information includes in particular a measure of the quality of the copies compared to a reference element, as well as the exact position and orientation of each copy and some quantification concerning their proportion in the genome/chromosome sequence, allowing for a fast and accurate assessment of the exact TE content. Additionally, the sequence of each copy with or without flanking sequences can be retrieved directly, allowing further analyses of the TEs. We hope that this tool will assist non-bioinformatics scientists in the more accurate identification of TE copies.

Availability and requirements

Project name: One code to find them all.

Project home: http://doua.prabi.fr/software/one-code-to-find-them-all.

Operating system(s): Linux/Unix, Mac OS X, Windows (with Perl installed).

Programming language: Perl.

License: GNU General Public License.