Abstract
Designing external memory data structures for string data-bases is of significant recent interest due to the proliferation of biological sequence data. The suffix tree is an important indexing structure that provides optimal algorithms for memory bound data. However, string B-trees provide the best known asymptotic performance in external memory for substring search and update operations. Work on external memory variants of suffix trees has largely focused on constructing suffix trees in external memory or layout schemes for suffix trees that preserve link locality. In this paper, we present a new suffix tree layout scheme for secondary storage and present construction, substring search, insertion and deletion algorithms that are competitive with the string B-tree. For a set of strings of total length n, a pattern p and disk blocks of size B, we provide a substring search algorithm that uses O(|p|/B + log B n) disk accesses. We present algorithms for insertion and deletion of all suffixes of a string of length m that take O(m log B (n+m)) and O(mlog B n) disk accesses, respectively. Our results demonstrate that suffix trees can be directly used as efficient secondary storage data structures for string and sequence data.
Research supported by the National Science Foundation under IIS-0430853.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Bedathur, S.J., Haritsa, J.R.: Engineering a fast online persistent suffix tree construction. In: Proc. 20th International Conference on Data Engineering, pp. 720–731 (2004)
Bedathur, S.J., Haritsa, J.R.: Search-optimized suffix-tree storage for biological applications. In: Bader, D.A., Parashar, M., Sridhar, V., Prasanna, V.K. (eds.) HiPC 2005. LNCS, vol. 3769, pp. 29–39. Springer, Heidelberg (2005)
Clark, D.R., Munro, J.I.: Efficient suffix trees on secondary storage. In: Proc. 7th ACM-SIAM Symposium on Discrete Algorithms, pp. 383–391 (1996)
Farach, M.: Optimal suffix tree construction with large alphabets. In: Proc. 38th Annual Symposium on Foundations of Computer Science, pp. 137–143 (1997)
Farach-Colton, M., Ferragina, P., Muthukrishnan, S.: On the sorting-complexity of suffix tree construction. Journal of the ACM 47(6), 987–1011 (2000)
Ferragina, P., Grossi, R.: Fast string searching in secondary storage: theoretical developments and experimental results. In: Proc. 7th Annual ACM-SIAM Symposium on Discrete Algorithms, pp. 373–382 (1996)
Ferragina, P., Grossi, R.: The string B-tree: A new data structure for string search in external memory and its applications. Journal of the ACM 46(2), 236–280 (1999)
Gonnet, G.H., Baeza-Yates, R.A., Snider, T.: Information Retrieval: Data Structures & Algorithms. In: New indices for text: PAT trees and PAT arrays, ch. 5, pp. 66–82 (1992)
Grossi, R., Vitter, J.S.: Compressed suffix arrays and suffix trees with applications to text indexing and string matching. In: Proc. 32nd Annual ACM Symposium on Theory of Computing, pp. 397–406 (2000)
Hunt, E., Atkinson, M.P., Irving, R.W.: Database indexing for large DNA and protein sequence collections. The VLDB Journal 11(3), 256–271 (2002)
McCreight, E.M.: A space-economical suffix tree construction algorithm. Journal of the ACM 23, 262–272 (1976)
Tata, S., Hankins, R.A., Patel, J.M.: Practical suffix tree construction. In: Proc. 13th International Conference on Very Large Data Bases, pp. 36–47 (2004)
Ukkonen, E.: On-line construction of suffix-trees. Algorithmica 14, 249–260 (1995)
Vitter, J.S., Shriver, E.A.M.: Algorithms for parallel memory I: Two-level memories. Algorithmica 12(2/3), 110–147 (1994)
Weiner, P.: Linear pattern matching algorithms. In: Proc. 14th Symposium on Switching and Automata Theory, pp. 1–11 (1973)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2006 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Ko, P., Aluru, S. (2006). Obtaining Provably Good Performance from Suffix Trees in Secondary Storage. In: Lewenstein, M., Valiente, G. (eds) Combinatorial Pattern Matching. CPM 2006. Lecture Notes in Computer Science, vol 4009. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11780441_8
Download citation
DOI: https://doi.org/10.1007/11780441_8
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-35455-0
Online ISBN: 978-3-540-35461-1
eBook Packages: Computer ScienceComputer Science (R0)