Estimating the Number of Substring Matches in Long String Databases

Bae, Jinuk; Lee, Sukho

doi:10.1007/978-3-540-31849-1_15

Jinuk Bae²¹ &
Sukho Lee²¹

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 3399))

Included in the following conference series:

Asia-Pacific Web Conference

524 Accesses

Abstract

Estimating the number of substring matches is one of problems that estimate alphanumeric selectivity using statistical information for strings. In the context of alphanumeric selectivity estimation, a CS-tree (Count Suffix Tree), which is a variation of a suffix tree, has been used as a basic data structure to store statistical information for substrings. However, even though the CS-tree is useful to keep information about short strings such as name or title, the CS-tree has two drawbacks: one is that some count values that the CS-tree keeps can be incorrect, and the other is that it is almost impossible to build the CS-tree over long strings such as biological sequences.

Therefore, for estimating the number of substring matches in long strings, we propose a CQ-tree (Count Q-gram Tree), which keeps the exact count values of all substrings of length q or below q located in the long strings, and can be constructed in one scan of data strings.

Furthermore, on the basis of the CQ-tree, we return the lower and upper bounds that the number of occurrences of a query can reach to, together with the estimated count of the query pattern. These bounds are mathematically proved. To the best of our knowledge, our work is the first one that presents the lower and upper bounds among research activities about alphanumeric selectivity estimation.

This work was supported in part by the Brain Korea 21 Project and in part by the Ministry of Information & Communications, Korea, under the Information Technology Research Center (ITRC) Support Program in 2004.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 129.00; Price excludes VAT (USA)

Softcover Book: USD 169.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Chen, Z., Jagadish, H.V., Korn, F., Koudas, N., Muthukrishnan, S., Ng, R.T., Srivastava, D.: Counting twig matches in a tree. In: ICDE, pp. 595–604 (2001)
Google Scholar
Farach, M., Ferragina, P., Muthukrishnan, S.: Overcoming the memory bottlenect in suffix tree construction. In: 39th Annual Symposium on Foundations of Computer Science, pp. 174–185 (1998)
Google Scholar
Fiala, E.R., Greene, D.H.: Data compression with finite windows. Comm. of the ACM 32, 490–505 (1989)
Article Google Scholar
Grossi, R., Vitter, J.S.: Compressed suffix arrays and suffix trees with applications to text indexing and string matching. In: ACM Symposium on Theory of Computing, pp. 397–456 (2000)
Google Scholar
Horowitz, E., Sahni, S., Mehta, D.: Fundamentals of data structures in C++. W H Freeman & Co., New York (1995)
Google Scholar
Jagadish, H.V., Kapitskaia, O., Ng, R.T., Srivastava, D.: Multi-dimensional substring selectivity estimation. In: VLDB, pp. 387–398 (1999)
Google Scholar
Jagadish, H.V., Ng, R.T., Srivastava, D.: Substring selectivity estimation. In: PODS, pp. 249–260 (1999)
Google Scholar
Jagadish, H.V., Ng, R.T., Srivastava, D.: On effective multi-dimensional indexing for strings. In: ACM SIGMOD, pp. 403–414 (2000)
Google Scholar
Krishnan, P., Vitter, J.S., Iyer, B.: Estimating alphanumeric selectivity in the presence of wildcards. In: ACM SIGMOD, pp. 282–293 (1996)
Google Scholar
McCreight, E.M.: A space-economical suffix tree construction algorithm. Journal of the ACM 23, 262–272 (1976)
Article MATH MathSciNet Google Scholar
Navarro, G., Baeza-Yates, R.: A practical q-gram index for text retrieval allowing errors. CLEI Electronic Journal 1 (1998)
Google Scholar
Navarro, G., Sutinen, E., Tanninen, J., Tarhio, J.: Indexing text with approximate q-grams. In: Combinatorial Pattern Matching, pp. 350–363 (2000)
Google Scholar
NCBI (2001), http://ncbi.nlm.nih.gov
Weiner, P.: Linear pattern matching algorithms. In: IEEE 14th Annual Symp. On Switching and Automata Theory, pp. 1–11 (1990)
Google Scholar

Download references

Author information

Authors and Affiliations

School of Electrical Engineering and Computer Science, Seoul National University, Korea
Jinuk Bae & Sukho Lee

Authors

Jinuk Bae
View author publications
You can also search for this author in PubMed Google Scholar
Sukho Lee
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Victoria University, Australia
Yanchun Zhang
University of Kyoto, Japan
Katsumi Tanaka
Chinese University of Hong Kong, Hong Kong, China
Jeffrey Xu Yu
Key Laboratory of Data Engineering and Knowledge Engineering, Renmin University of China, MOE, 100872, Beijing, P.R. China
Shan Wang
Department of Computer Science and Engineering, Shanghai Jiatong University, 80 Dongcuan Road, 200240, Shanghai, China
Minglu Li

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Bae, J., Lee, S. (2005). Estimating the Number of Substring Matches in Long String Databases. In: Zhang, Y., Tanaka, K., Yu, J.X., Wang, S., Li, M. (eds) Web Technologies Research and Development - APWeb 2005. APWeb 2005. Lecture Notes in Computer Science, vol 3399. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-31849-1_15

Download citation

DOI: https://doi.org/10.1007/978-3-540-31849-1_15
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-25207-8
Online ISBN: 978-3-540-31849-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics