A Complete and Accurate Ab Initio Repeat Finding Algorithm

  • Shuaibin Lian
  • Xinwu Chen
  • Peng Wang
  • Xiaoli Zhang
  • Xianhua Dai
Original Research Article

Abstract

It has become clear that repetitive sequences have played multiple roles in eukaryotic genome evolution including increasing genetic diversity through mutation, changes in gene expression and facilitating generation of novel genes. However, identification of repetitive elements can be difficult in the ab initio manner. Currently, some classical ab initio tools of finding repeats have already presented and compared. The completeness and accuracy of detecting repeats of them are little pool. To this end, we proposed a new ab initio repeat finding tool, named HashRepeatFinder, which is based on hash index and word counting. Furthermore, we assessed the performances of HashRepeatFinder with other two famous tools, such as RepeatScout and Repeatfinder, in human genome data hg19. The results indicated the following three conclusions: (1) The completeness of HashRepeatFinder is the best one among these three compared tools in almost all chromosomes, especially in chr9 (8 times of RepeatScout, 10 times of Repeatfinder); (2) in terms of detecting large repeats, HashRepeatFinder also performed best in all chromosomes, especially in chr3 (24 times of RepeatScout and 250 times of Repeatfinder) and chr19 (12 times of RepeatScout and 60 times of Repeatfinder); (3) in terms of accuracy, HashRepeatFinder can merge the abundant repeats with high accuracy.

Keywords

Interspersed repeats Tandem repeats Repeat finder 

Notes

Acknowledgments

This paper was supported by key scientific research project of education department of Henan Province (Nos. 15A510010 and 15A510011) and doctoral scientific research start-up funds of Xinyang Normal University (No. 0201447).

Compliance with Ethical Standards

Conflict of interest

The authors declare that there is no conflict of interests regarding the publication of this paper.

References

  1. 1.
    Sharma D, Issac B, Raghava GPS, Ramaswamy R (2004) Spectral repeat finder (SRF): identification of repetitive sequences using Fourier transformation. Bioinformatics 20(9):1405–1412PubMedCrossRefGoogle Scholar
  2. 2.
    Lander ES, Linton LM, Birren B, Nusbaum C, Zody MC, Baldwin J, Devon K, Dewar K, Dolye M, FitzHugh W et al (2001) Initial sequencing and analysis of the human genome. Nature 409:860–921PubMedCrossRefGoogle Scholar
  3. 3.
    Kazazian HH Jr (2004) Mobile elements: drivers of genome evolution. Science 303:1626–1632PubMedCrossRefGoogle Scholar
  4. 4.
    Bennetzen JL (2000) Transposable element contributions to plant gene and genome evolution. Plant Mol Biol 42:251–269PubMedCrossRefGoogle Scholar
  5. 5.
    Morgante M, Brunner S, Pea G et al (2005) Gene duplication and exon shuffling by helitron-like transposons generate intraspecies diversity in maize. Nat Genet 37:997–1002PubMedCrossRefGoogle Scholar
  6. 6.
    Assaad FF, Tucker KL, Signer ER (1993) Epigenetic repeat-induced gene silencing (RIGS) in Arabidopsis. Plant Mol Biol 22:1067–1085PubMedCrossRefGoogle Scholar
  7. 7.
    Zuckerkandl E, Hennig W (1995) Tracking heterochromatin. Chromosoma 104:75–83PubMedGoogle Scholar
  8. 8.
    Lippman Z, Gendrel AV, Black M, Vaughn MW, Dedhia N, McCombie WR, Lavine K, Mittal V, May B, Kasschau KD et al (2004) Role of transposable elements in heterochromatin and epigenetic control. Nature 430:471–476PubMedCrossRefGoogle Scholar
  9. 9.
    Miller JR, Koren S, Sutton G (2010) Assembly algorithms for next-generation sequencing data. Genomics 95(6):315–327PubMedPubMedCentralCrossRefGoogle Scholar
  10. 10.
    Smit AF (1996) The origin of interspersed repeats in the human genome. Curr Opin Genet Dev 6:743–748PubMedCrossRefGoogle Scholar
  11. 11.
    Smit AFA, Green P (2013). RepeatMasker, http://repeatmasker.org
  12. 12.
    Jurka J, Klonowski P, Dagman V, Pelton P (1996) CENSOR—a program for identification and elimination of repetitive elements from DNA sequences. Comput Chem 20:119–122PubMedCrossRefGoogle Scholar
  13. 13.
    Bedell JA, Korf I, Gish W (2000) MaskerAid: a performance enhancement to RepeatMasker. Bioinformatics 16:1040–1041PubMedCrossRefGoogle Scholar
  14. 14.
    Bao Z, Eddy SR (2002) Automated de novo identification of repeat sequence families in sequenced genomes. Genome Res 12:1269–1276PubMedPubMedCentralCrossRefGoogle Scholar
  15. 15.
    Edgar RC, Myers EW (2005) PILER: identification and classification of genomic repeats. Bioinformatics 21:i152–i158PubMedCrossRefGoogle Scholar
  16. 16.
    Price AL, Jones NC, Pevzner PADe (2005) novo identification of repeat families in large genomes. Bioinformatics 21:i351–i358PubMedCrossRefGoogle Scholar
  17. 17.
    Volfovsky N, Haas BJ, Salzberg SL (2001) A clustering method for repeat analysis in DNA sequences. Genome Biol 2:research0027–research0027.11Google Scholar
  18. 18.
    Saha S, Bridges S, Magbanua ZV, Peterson DG (2008) Empirical comparison of ab initio repeat finding programs. Nucl Acids Res 36(7):2284–2294PubMedPubMedCentralCrossRefGoogle Scholar
  19. 19.
    Smith TF, Waterman MS (1981) Identification of common molecular subsequences. J Mol Biol 147(1):195–197PubMedCrossRefGoogle Scholar

Copyright information

© International Association of Scientists in the Interdisciplinary Areas and Springer-Verlag Berlin Heidelberg 2015

Authors and Affiliations

  • Shuaibin Lian
    • 1
  • Xinwu Chen
    • 1
  • Peng Wang
    • 1
  • Xiaoli Zhang
    • 1
  • Xianhua Dai
    • 2
  1. 1.School of Physics and Electronic EngineeringXinyang Normal UniversityXinyang CityPeople’s Republic of China
  2. 2.School of Information Science and TechnologySun Yat-Sen UniversityGuangzhou CityPeople’s Republic of China

Personalised recommendations