BIRD 2008: Bioinformatics Research and Development pp 87-101 | Cite as
Searching for Supermaximal Repeats in Large DNA Sequences
Abstract
We study the problem of finding supermaximal repeats in large DNA sequences. For this, we propose an algorithm called SMR which uses an auxiliary index structure (POL), which is derived from and replaces the suffix tree index STTD64 [1]. The results of our numerous experiments using the 24 human chromosomes data indicate that SMR outperforms the solution provided as part of the Vmatch [2] software tool. In searching for supermaximal repeats of size at least 10 bases, SMR is twice faster than Vmatch; for a minimum length of 25 bases, SMR is 7 times faster; and for repeats of length at least 200, SMR is about 9 times faster. We also study the cost of POL in terms of time and space requirements.
Keywords
DNA sequences supermaximal repeats suffix tree performancePreview
Unable to display preview. Download preview PDF.
References
- 1.Halachev, M., Shiri, N., Thamildurai, A.: Efficient and scalable indexing techniques for biological sequence data. In: Hochreiter, S., Wagner, R. (eds.) BIRD 2007. LNCS (LNBI), vol. 4414, pp. 464–479. Springer, Heidelberg (2007)CrossRefGoogle Scholar
- 2.Vmatch: large scale sequence analysis software, http://www.vmtach.de
- 3.Gusfield, D.: Algorithms on strings, trees and sequences: computer science and computational biology. Cambridge University Press, New York (1997)MATHGoogle Scholar
- 4.Korf, B.R.: Human Genetics: A Problem-Based Approach. Blackwell, Boston (2000)Google Scholar
- 5.Watson, J., Hopkins, N., Roberts, J., Steitz, J., Weiner, A.: Molecular Biology of the Gene, 6th edn. Benjamin-Cummings, Menlo Park (2007)Google Scholar
- 6.Kurtz, S.: Reducing the space requirement of suffix trees. Software Practice and Experience 29(13), 1149–1171 (1999)CrossRefGoogle Scholar
- 7.Grossi, R., Vitter, J.S.: Compressed Suffix Arrays and Suffix Trees with Applications to Text Indexing and String Matching. SIAM Journal on Computing 35(2), 378–407 (2005)MATHCrossRefMathSciNetGoogle Scholar
- 8.Ferragina, P., Manzini, G.: Opportunistic data structures with applications. In: 41st IEEE Symposium on Foundations of Computer Science, pp. 390–398 (2000)Google Scholar
- 9.Valimaki, N., Gerlach, W., Dixit, K., Makinen, V.: Compressed suffix tree - a basis for genome-scale sequence analysis. Bioinformatics 23(5), 629–630 (2007)CrossRefGoogle Scholar
- 10.Hon, W.-K., Lam, T.W., Sung, W.-K., Tse, W.-L., Wong, C.-K., Yiu, S.-M.: Practical Aspects of Compressed Suffix Arrays and FM-index in Searching DNA Sequences. In: 6th Workshop on Algorithm Engineering and Experiments, pp. 31–38 (2004)Google Scholar
- 11.Kurtz, S., Schleiermacher, C.: REPuter: Fast Computation of Maximal Repeats in Complete Genomes. Bioinformatics 15, 426–427 (1999)CrossRefGoogle Scholar
- 12.RepeatMatch, http://mummer.sourceforge.net/manual/#repeat
- 13.RepeatMasker, http://repeatmasker.org/
- 14.Bedell, J.A., Korf, I., Gish, W.: MaskerAid: a Performance Enhancement to RepeatMasker. Bioinformatics 16(11), 1040–1041 (2000)CrossRefGoogle Scholar
- 15.Gotoh, O.: An Improved Algorithm for Matching Biological Sequences. Journal of Molecular Biology 162(3), 705–708 (1982)CrossRefGoogle Scholar
- 16.Abouelhoda, M.I., Kurtz, S., Ohlebusch, E.: Replacing suffix tree with enhances suffix arrays. Journal of Discrete Algorithms 2(1), 53–86 (2004)MATHCrossRefMathSciNetGoogle Scholar
- 17.Miki, B.L., Neelin, J.M.: DNA repeat lengths of erythrocyte chromatins differing in content of histones H1 and H5. Nucleic Acids Res. 8(3), 529–542 (1980)CrossRefGoogle Scholar
- 18.National Center for Biotechnology Information, http://www.ncbi.nim.nih.gov