LATA 2010: Language and Automata Theory and Applications pp 512-523 | Cite as
A Randomized Numerical Aligner (rNA)
Abstract
With the advent of new sequencing technologies able to produce an enormous quantity of short genomic sequences, new tools able to search for them inside a references sequence genome have emerged. Because of chemical reading errors or of the variability between organisms, one is interested in finding not only exact occurrences, but also occurrences with up to k mismatches. The contribution of this paper is twofold. On one hand, we present a generalization of the classical Rabin-Karp string matching algorithm to solve the k-mismatch problem, with average complexity \(\mathcal{O}(n+m)\). On the other hand, we show how to employ this idea in conjunction with an index over the text, allowing to search a pattern, with up to k mismatches, in time proportional to its length. This novel tool—rNA (randomized Numerical Aligner)—outperforms available tools like SOAP2, BWA, and BOWTIE, processing up to 10 times more patterns per second on texts of (practically) significant lengths.
Keywords
String Match Reference Sequence Genome Processor Word Residue Number System Average ComplexityPreview
Unable to display preview. Download preview PDF.
References
- 1.Abrahamson, K.: Generalized string matching. SIAM Journal on Computing 16(6), 1039–1051 (1987)MATHCrossRefMathSciNetGoogle Scholar
- 2.Altschul, S.F., Madden, T.L., Schäffer, A.A., Zhang, J., Zhang, Z., Miller, W., Lipman, D.J.: Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Research 25(17), 3389–3402 (1997)CrossRefGoogle Scholar
- 3.Amir, A., Lewenstein, M., Porat, E.: Faster algorithms for string matching with k mismatches. Journal of Algorithms 50, 257–275 (2004)MATHCrossRefMathSciNetGoogle Scholar
- 4.Boyer, R.S., Moore, J.S.: A fast string searching algorithm. Commun. ACM 20(10), 762–772 (1977)CrossRefGoogle Scholar
- 5.Cormen, T.H., Leiserson, C.E., Rivest, R.L., Stein, C.: Introduction to Algorithms, 2nd edn. MIT Press, McGraw-Hill Book Company (2001)Google Scholar
- 6.Ferragina, P.: String algorithms and data structures. CoRR abs/0801.2378 (2008)Google Scholar
- 7.Galil, Z., Giancarlo, R.: Improved string matching with k mismatches. SIGACT News 17(4), 52–54 (1986)CrossRefGoogle Scholar
- 8.Grossi, R., Vitter, J.S.: Compressed suffix arrays and suffix trees with applications to text indexing and string matching. SIAM J. Comput. 35(2), 378–407 (2005)MATHCrossRefMathSciNetGoogle Scholar
- 9.Horner, D.S., Pavesi, G., Castrignano, T., De Meo, P.D., Liuni, S., Sammeth, M., Picardi, E., Pesole, G.: Bioinformatics approaches for genomics and post genomics applications of next-generation sequencing. Brief. Bioinform., bbp046+ (2009)Google Scholar
- 10.Huynh, T.N.D., Hon, W.K., Lam, T.W., Sung, W.K.: Approximate string matching using compressed suffix arrays. Theor. Comput. Sci. 352(1), 240–249 (2006)MATHCrossRefMathSciNetGoogle Scholar
- 11.Jokinen, P., Ukkonen, E.: Two algorithms for approximate string matching in static texts. In: Proc. 2nd Ann. Symp. on Mathematical Foundations of Computer Science, vol. 520, pp. 240–248 (1991)Google Scholar
- 12.Karp, R., Rabin, M.: Efficient randomized pattern-matching algorithms. IBM J. Res. Develop. 31(2), 249–260 (1987)MATHCrossRefMathSciNetGoogle Scholar
- 13.Kent, W.J.: BLAT—The BLAST-like Alignment Tool. Genome research 12(4), 656–664 (2002)MathSciNetGoogle Scholar
- 14.Knuth, D.E., Morris, J.H., Pratt, V.R.: Fast pattern matching in strings. SIAM Journal on Computing 6(2), 323–350 (1977)MATHCrossRefMathSciNetGoogle Scholar
- 15.Landau, G.M., Vishkin, U.: Efficient string matching in the presence of errors. In: Proceedings of the 26th IEEE Symposium on Foundations of Computer Science, pp. 126–136 (1985)Google Scholar
- 16.Landau, G.M., Vishkin, U.: Efficient string matching with k mismatches. Theoretical Computer Science 43, 239–249 (1986)MATHCrossRefMathSciNetGoogle Scholar
- 17.Langmead, B., Trapnell, C., Pop, M., Salzberg, S.: Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biology 10(3), R25 (2009)CrossRefGoogle Scholar
- 18.Li, H., Durbin, R.: Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics 25(14), 1754–1760 (2009)CrossRefGoogle Scholar
- 19.Li, R., Yu, C., Li, Y., Lam, T.W., Yiu, S.M., Kristiansen, K., Wang, J.: SOAP2: an improved ultrafast tool for short read alignment. Bioinformatics 25(15), 1966–1967 (2009)CrossRefGoogle Scholar
- 20.Liu, Z., Chen, X., Borneman, J., Jiang, T.: A fast algorithm for approximate string matching on gene sequences. In: Apostolico, A., Crochemore, M., Park, K. (eds.) CPM 2005. LNCS, vol. 3537, pp. 79–90. Springer, Heidelberg (2005)Google Scholar
- 21.Manber, U., Myers, G.: Suffix arrays: A new method for on-line string searches. In: SODA ’90: Proc. 1st Ann. ACM-SIAM Symp. on Discrete Algorithms, pp. 319–327. Society for Industrial and Applied Mathematics, Philadelphia (1990)Google Scholar
- 22.Muth, R., Manber, U.: Approximate multiple string search. In: Proc. 7th Ann. Symp. on Combinatorial Pattern Matching, Laguna Beach, CA, pp. 75–86 (1996)Google Scholar
- 23.Policriti, A., Tomescu, A.I., Vezzi, F.: A Randomized Numerical Aligner (rNA) (2010), http://sole.dimi.uniud.it/~alexandru.tomescu/files/rNA-ext.pdf
- 24.Salmela, L., Tarhio, J., Kalsi, P.: Approximate Boyer-Moore string matching for small alphabets. Algorithmica (to appear)Google Scholar
- 25.Ukkonen, E.: Approximate string matching over suffix trees. In: Proc. 4th Ann. Symp. on Combinatorial Pattern Matching, pp. 228–242 (1993)Google Scholar
- 26.Zimmermann, R.: Efficient VLSI Implementation of Modulo (2n ±1) Addition and Multiplication. In: IEEE Symposium on Computer Arithmetic, pp. 158–167. IEEE Computer Society, Los Alamitos (1999)Google Scholar