Abstract
In this study, we present an online suffix tree construction approach where multiple sequences are indexed by a single suffix tree. Due to the poor memory locality and high space consumption, online suffix tree construction on disk is a striving process. Even more, performance of the construction suffers when alphabet size is large. In order to overcome these difficulties, first, we present a space efficient node representation approach to be used in Ukkonen suffix tree construction algorithm. Next, we show that performance can be increased through incorporating semantic knowledge such as utilizing the frequently used letters of an alphabet. In particular, we estimate the frequently accessed nodes of the tree and introduce a sequence insertion strategy into the tree. As a result, we can speed up accessing to the frequently accessed nodes. Finally, we analyze the contribution of buffering strategies and page sizes on performance and perform detailed tests. We run a series of experimentation under various buffering strategies and page sizes. Experimental results showed that our approach outperforms existing ones.
Keywords
- Suffix trees
- sequence databases
- time series indexing
- poor memory locality
This is a preview of subscription content, access via your institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsPreview
Unable to display preview. Download preview PDF.
References
Abouelhoda, M.I., Kurtz, S., Ohlebusch, E.: Replacing suffix trees with enhanced suffix arrays. Journal of Discrete Algorithms 2 (2004)
Bedathur, S., Haritsa, J.: Engineering a fast online persistent suffix tree construction. In: Proceedings of ICDE (2004)
Bieganski, J.R.P., Carlis, J.V.: Generalized suffix trees for biological sequence data: Application and implantation. In: Proc. of 27th HICSS. IEEE, Hawai (1994)
Cheung, C.-F., Yu, J.X., Lu, H.: Constructing suffix tree for gigabyte sequences with megabyte memory. IEEE Transactions on Knowledge and Data Engineering (2005)
Clifford, R., Sergot, M.J.: Distributed and paged suffix trees for large genetic databases. In: Baeza-Yates, R., Chávez, E., Crochemore, M. (eds.) CPM 2003. LNCS, vol. 2676. Springer, Heidelberg (2003)
Cormen, T.H., Leiserson, C.E., Rivest, R.L.: Introduction to Algorithms. The MIT Press, Boston (1989)
Farach, M., Ferragina, P., Muthukrishnan, S.: Overcoming the memory bottleneck in suffix tree construction. In: 39th Symp. on Foundations of Computer Science. IEEE Computer Society, Los Alamitos (1998)
Ferragina, P., Grossi, R., Montangero, M.: A note on updating suffix tree labels. Theoretical Computer Science (1998)
Folk, M., Riccardi, G., Zoellick, B.: File structures: an object-oriented approach with C++, 3rd edn. Addison-Wesley Longman Publishing, Amsterdam (1997)
Giegerich, R., Kurtz, S.: From Ukkonen to McCreight and Weiner: a unifying view of linear-time suffix tree construction. Algorithmica 19(3), 331–353 (1997)
Gusfield, D.: Algorithms on strings, trees, and sequences Computer Science and Computational Biology. Cambridge Univ. Press, Cambridge (1997)
Huang, Y.-W., Yu, P.S.: Adaptive query processing for time-series data. In: Proceedings of KDD. ACM Press, New York (1999)
Hunt, E., Atkinson, M.P., Irving, R.W.: A database index to large biological sequences. In: 27th Int’l Conf. Very Large Data Bases. ACM Press, New York (2001)
Kurtz, S.: Reducing the space requirement of suffix trees. Software—Practice & Experience 29(13), 1149–1171 (1999)
Lemström, K.: String matching techniques for music retrieval, PhD thesis, University of Helsinki, Department of Computer Science (November 2000)
Manber, U., Myers, G.: Suffix arrays: a new method for on-line string searches. SIAM Journal on Computing (1993)
Martinez, H.M.: An efficient method for indexing repeats in molecular sequences. Nucleic Acids Research (1983)
McCreight, E.M.: A Space-economical suffix tree construction algorithm. Journal of ACM 23 (1976)
Munro, J.I., Raman, V., Rao, S.: Space efficient suffix trees. J. of Algorithms 2 (2001)
Navarro, G.: A guided tour to approximate string matching. ACM Computing Surveys
Phoophakdee, B., Zaki, M.: Genome-scale disk based suffix tree indexing. In: Proceedings of ACM SIGMOD (2007)
Sandeep, A., Akinapelli, S.: Online construction of search-friendly persistent suffix-tree layouts. M.Sc thesis, Indian Institute of Science Bangalore (July 2006)
Salzberg, B.: File Structures: An analytic approach. Prentice-Hall, Englewood Cliffs (1988)
Schürmann, K., Stoye, J.: Suffix tree construction and storage with limited main memory. unpublished technical report, Univ. Biefeld (2003)
Tian, Y., Tata, S., Hankins, R.A., Patel, J.M.: Practical methods for constructing suffix trees. The VLDB Journal (2005)
Ukkonen, E.: On-line construction of suffix-trees. Algorithmica (1995)
Weiner, P.: Linear pattern matching algorithm. In: Proc. of 14th IEEE Symp. On Switching and Automata Theory (1973)
Wong, S., Sung, W., Wong, L.: CPS-tree: A compact partitioned suffix tree for disk based indexing on large genome sequences. In: Proc. of IEEE ICDE, Istanbul (2007)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2008 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Ozcan, G., Alpkocak, A. (2008). Online Suffix Tree Construction for Streaming Sequences. In: Sarbazi-Azad, H., Parhami, B., Miremadi, SG., Hessabi, S. (eds) Advances in Computer Science and Engineering. CSICC 2008. Communications in Computer and Information Science, vol 6. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-89985-3_9
Download citation
DOI: https://doi.org/10.1007/978-3-540-89985-3_9
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-89984-6
Online ISBN: 978-3-540-89985-3
eBook Packages: Computer ScienceComputer Science (R0)