Indexing Factors in DNA/RNA Sequences

  • Tomáš Flouri
  • Costas Iliopoulos
  • M. Sohel Rahman
  • Ladislav Vagner
  • Michal Voráček
Conference paper
Part of the Communications in Computer and Information Science book series (CCIS, volume 13)

Abstract

In this paper, we present the Truncated Generalized Suffix Automaton (TGSA) and present an efficient on-line algorithm for its construction. TGSA is a novel type of finite automaton suitable for indexing DNA and RNA sequences, where the text is degenerate i.e. contains sets of characters. TGSA indexes the so called k-factors, the factors of the degenerate text with length not exceeding a given constant k. The presented algorithm works in \(\mathcal{O}{(n^2)}\) time, where n is the length of the input DNA/RNA sequence. The resulting TGSA has at most linear number of states with respect to the length of the text. TGSA enables us to find the list occ(u) of all occurrences of a given pattern u in degenerate text in time |u| + |occ(u)|.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Voráček, M., Melichar, B., Christodoulakis, M.: Generalized and weighted strings: Repetitions and pattern matching. In: String Algorithmics, pp. 225–248. KCL Publications, King’s College London (2004)Google Scholar
  2. 2.
    Voráček, M., Melichar, B.: Searching for regularities in generalized strings using finite automata. In: Proceedings of International Conference on Numerical Analysis and Applied Mathematics. Wiley-VCH, Weinheim (2005)Google Scholar
  3. 3.
    Voráček, M., Vagner, V., Flouri, T.: Indexing Degenerate Strings. In: Proceedings of International Conference on Computational Methods in Science and Engineering, American Institute of Physics, Melville, New York (2007)Google Scholar
  4. 4.
    Flouri, T.: Indexing Degenerate Strings, Master Thesis, Czech Technical University, Prague (2008)Google Scholar
  5. 5.
    Hopcroft, J.E., Ullman, J.D.: Introduction to automata, languages and computation. Addison-Wesley, Reading (1979)Google Scholar
  6. 6.
    Navarro, G., Sutinen, E., Tanninen, J., Tarhio, J.: Indexing Text with Approximate q-Grams. In: Giancarlo, R., Sankoff, D. (eds.) CPM 2000. LNCS, vol. 1848, pp. 350–363. Springer, Heidelberg (2000)CrossRefGoogle Scholar
  7. 7.
    Iliopoulos, C.S., McHugh, J., Peterlongo, P., Pisanti, N., Rytter, W., Sagot, M.: A First Approach to Finding Common Motifs with Gaps. International Journal of Foundations of Computer Science (2004)Google Scholar
  8. 8.
    Ma, B., Tromp, J., Li, M.: Patternhunter: faster and more sensitive homology search. Bioinformatics 18(3), 440–445 (2002)CrossRefGoogle Scholar
  9. 9.
    Iliopoulos, C.S., Sohel Rahman, M.: Indexing factors with gaps. Algorithmica (to appear) (DOI: 10.1007/s00453-007-9141-3)Google Scholar
  10. 10.
    Peterlongo, P., Allali, J., Sagot, M.-F.: The gapped-factor tree. In: Holub, J., Zdrek, J. (eds.) Stringology, pp. 182–196. Czech Technical University, Prague (2006)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2008

Authors and Affiliations

  • Tomáš Flouri
    • 1
  • Costas Iliopoulos
    • 2
  • M. Sohel Rahman
    • 2
  • Ladislav Vagner
    • 1
  • Michal Voráček
    • 1
  1. 1.Department of Computer Science & EngineeringCzech Technical University in PragueCzech Republic
  2. 2.Algorithm Design Group Department of Computer ScienceKing’s College London, StrandLondonEngland

Personalised recommendations