A Complete and Accurate Ab Initio Repeat Finding Algorithm
- 216 Downloads
- 2 Citations
Abstract
It has become clear that repetitive sequences have played multiple roles in eukaryotic genome evolution including increasing genetic diversity through mutation, changes in gene expression and facilitating generation of novel genes. However, identification of repetitive elements can be difficult in the ab initio manner. Currently, some classical ab initio tools of finding repeats have already presented and compared. The completeness and accuracy of detecting repeats of them are little pool. To this end, we proposed a new ab initio repeat finding tool, named HashRepeatFinder, which is based on hash index and word counting. Furthermore, we assessed the performances of HashRepeatFinder with other two famous tools, such as RepeatScout and Repeatfinder, in human genome data hg19. The results indicated the following three conclusions: (1) The completeness of HashRepeatFinder is the best one among these three compared tools in almost all chromosomes, especially in chr9 (8 times of RepeatScout, 10 times of Repeatfinder); (2) in terms of detecting large repeats, HashRepeatFinder also performed best in all chromosomes, especially in chr3 (24 times of RepeatScout and 250 times of Repeatfinder) and chr19 (12 times of RepeatScout and 60 times of Repeatfinder); (3) in terms of accuracy, HashRepeatFinder can merge the abundant repeats with high accuracy.
Keywords
Interspersed repeats Tandem repeats Repeat finderNotes
Acknowledgments
This paper was supported by key scientific research project of education department of Henan Province (Nos. 15A510010 and 15A510011) and doctoral scientific research start-up funds of Xinyang Normal University (No. 0201447).
Compliance with Ethical Standards
Conflict of interest
The authors declare that there is no conflict of interests regarding the publication of this paper.
References
- 1.Sharma D, Issac B, Raghava GPS, Ramaswamy R (2004) Spectral repeat finder (SRF): identification of repetitive sequences using Fourier transformation. Bioinformatics 20(9):1405–1412PubMedCrossRefGoogle Scholar
- 2.Lander ES, Linton LM, Birren B, Nusbaum C, Zody MC, Baldwin J, Devon K, Dewar K, Dolye M, FitzHugh W et al (2001) Initial sequencing and analysis of the human genome. Nature 409:860–921PubMedCrossRefGoogle Scholar
- 3.Kazazian HH Jr (2004) Mobile elements: drivers of genome evolution. Science 303:1626–1632PubMedCrossRefGoogle Scholar
- 4.Bennetzen JL (2000) Transposable element contributions to plant gene and genome evolution. Plant Mol Biol 42:251–269PubMedCrossRefGoogle Scholar
- 5.Morgante M, Brunner S, Pea G et al (2005) Gene duplication and exon shuffling by helitron-like transposons generate intraspecies diversity in maize. Nat Genet 37:997–1002PubMedCrossRefGoogle Scholar
- 6.Assaad FF, Tucker KL, Signer ER (1993) Epigenetic repeat-induced gene silencing (RIGS) in Arabidopsis. Plant Mol Biol 22:1067–1085PubMedCrossRefGoogle Scholar
- 7.Zuckerkandl E, Hennig W (1995) Tracking heterochromatin. Chromosoma 104:75–83PubMedGoogle Scholar
- 8.Lippman Z, Gendrel AV, Black M, Vaughn MW, Dedhia N, McCombie WR, Lavine K, Mittal V, May B, Kasschau KD et al (2004) Role of transposable elements in heterochromatin and epigenetic control. Nature 430:471–476PubMedCrossRefGoogle Scholar
- 9.Miller JR, Koren S, Sutton G (2010) Assembly algorithms for next-generation sequencing data. Genomics 95(6):315–327PubMedPubMedCentralCrossRefGoogle Scholar
- 10.Smit AF (1996) The origin of interspersed repeats in the human genome. Curr Opin Genet Dev 6:743–748PubMedCrossRefGoogle Scholar
- 11.Smit AFA, Green P (2013). RepeatMasker, http://repeatmasker.org
- 12.Jurka J, Klonowski P, Dagman V, Pelton P (1996) CENSOR—a program for identification and elimination of repetitive elements from DNA sequences. Comput Chem 20:119–122PubMedCrossRefGoogle Scholar
- 13.Bedell JA, Korf I, Gish W (2000) MaskerAid: a performance enhancement to RepeatMasker. Bioinformatics 16:1040–1041PubMedCrossRefGoogle Scholar
- 14.Bao Z, Eddy SR (2002) Automated de novo identification of repeat sequence families in sequenced genomes. Genome Res 12:1269–1276PubMedPubMedCentralCrossRefGoogle Scholar
- 15.Edgar RC, Myers EW (2005) PILER: identification and classification of genomic repeats. Bioinformatics 21:i152–i158PubMedCrossRefGoogle Scholar
- 16.Price AL, Jones NC, Pevzner PADe (2005) novo identification of repeat families in large genomes. Bioinformatics 21:i351–i358PubMedCrossRefGoogle Scholar
- 17.Volfovsky N, Haas BJ, Salzberg SL (2001) A clustering method for repeat analysis in DNA sequences. Genome Biol 2:research0027–research0027.11Google Scholar
- 18.Saha S, Bridges S, Magbanua ZV, Peterson DG (2008) Empirical comparison of ab initio repeat finding programs. Nucl Acids Res 36(7):2284–2294PubMedPubMedCentralCrossRefGoogle Scholar
- 19.Smith TF, Waterman MS (1981) Identification of common molecular subsequences. J Mol Biol 147(1):195–197PubMedCrossRefGoogle Scholar