Advertisement

Obtaining Provably Good Performance from Suffix Trees in Secondary Storage

  • Pang Ko
  • Srinivas Aluru
Part of the Lecture Notes in Computer Science book series (LNCS, volume 4009)

Abstract

Designing external memory data structures for string data-bases is of significant recent interest due to the proliferation of biological sequence data. The suffix tree is an important indexing structure that provides optimal algorithms for memory bound data. However, string B-trees provide the best known asymptotic performance in external memory for substring search and update operations. Work on external memory variants of suffix trees has largely focused on constructing suffix trees in external memory or layout schemes for suffix trees that preserve link locality. In this paper, we present a new suffix tree layout scheme for secondary storage and present construction, substring search, insertion and deletion algorithms that are competitive with the string B-tree. For a set of strings of total length n, a pattern p and disk blocks of size B, we provide a substring search algorithm that uses O(|p|/B + log B n) disk accesses. We present algorithms for insertion and deletion of all suffixes of a string of length m that take O(m log B (n+m)) and O(mlog B n) disk accesses, respectively. Our results demonstrate that suffix trees can be directly used as efficient secondary storage data structures for string and sequence data.

Keywords

Internal Node Tree Construction External Memory Edge Label Suffix Tree 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Bedathur, S.J., Haritsa, J.R.: Engineering a fast online persistent suffix tree construction. In: Proc. 20th International Conference on Data Engineering, pp. 720–731 (2004)Google Scholar
  2. 2.
    Bedathur, S.J., Haritsa, J.R.: Search-optimized suffix-tree storage for biological applications. In: Bader, D.A., Parashar, M., Sridhar, V., Prasanna, V.K. (eds.) HiPC 2005. LNCS, vol. 3769, pp. 29–39. Springer, Heidelberg (2005)CrossRefGoogle Scholar
  3. 3.
    Clark, D.R., Munro, J.I.: Efficient suffix trees on secondary storage. In: Proc. 7th ACM-SIAM Symposium on Discrete Algorithms, pp. 383–391 (1996)Google Scholar
  4. 4.
    Farach, M.: Optimal suffix tree construction with large alphabets. In: Proc. 38th Annual Symposium on Foundations of Computer Science, pp. 137–143 (1997)Google Scholar
  5. 5.
    Farach-Colton, M., Ferragina, P., Muthukrishnan, S.: On the sorting-complexity of suffix tree construction. Journal of the ACM 47(6), 987–1011 (2000)CrossRefMathSciNetMATHGoogle Scholar
  6. 6.
    Ferragina, P., Grossi, R.: Fast string searching in secondary storage: theoretical developments and experimental results. In: Proc. 7th Annual ACM-SIAM Symposium on Discrete Algorithms, pp. 373–382 (1996)Google Scholar
  7. 7.
    Ferragina, P., Grossi, R.: The string B-tree: A new data structure for string search in external memory and its applications. Journal of the ACM 46(2), 236–280 (1999)CrossRefMathSciNetMATHGoogle Scholar
  8. 8.
    Gonnet, G.H., Baeza-Yates, R.A., Snider, T.: Information Retrieval: Data Structures & Algorithms. In: New indices for text: PAT trees and PAT arrays, ch. 5, pp. 66–82 (1992)Google Scholar
  9. 9.
    Grossi, R., Vitter, J.S.: Compressed suffix arrays and suffix trees with applications to text indexing and string matching. In: Proc. 32nd Annual ACM Symposium on Theory of Computing, pp. 397–406 (2000)Google Scholar
  10. 10.
    Hunt, E., Atkinson, M.P., Irving, R.W.: Database indexing for large DNA and protein sequence collections. The VLDB Journal 11(3), 256–271 (2002)CrossRefMATHGoogle Scholar
  11. 11.
    McCreight, E.M.: A space-economical suffix tree construction algorithm. Journal of the ACM 23, 262–272 (1976)CrossRefMathSciNetMATHGoogle Scholar
  12. 12.
    Tata, S., Hankins, R.A., Patel, J.M.: Practical suffix tree construction. In: Proc. 13th International Conference on Very Large Data Bases, pp. 36–47 (2004)Google Scholar
  13. 13.
    Ukkonen, E.: On-line construction of suffix-trees. Algorithmica 14, 249–260 (1995)CrossRefMathSciNetMATHGoogle Scholar
  14. 14.
    Vitter, J.S., Shriver, E.A.M.: Algorithms for parallel memory I: Two-level memories. Algorithmica 12(2/3), 110–147 (1994)CrossRefMathSciNetMATHGoogle Scholar
  15. 15.
    Weiner, P.: Linear pattern matching algorithms. In: Proc. 14th Symposium on Switching and Automata Theory, pp. 1–11 (1973)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2006

Authors and Affiliations

  • Pang Ko
    • 1
  • Srinivas Aluru
    • 2
  1. 1.Department of Electrical and Computer Engineering 
  2. 2.Laurence H. Baker Center for Bioinformatics and Biological StatisticsIowa State University 

Personalised recommendations