Abstract.
With the increasing importance of XML, LDAP directories, and text-based information sources on the Internet, there is an ever-greater need to evaluate queries involving (sub)string matching. In many cases, matches need to be on multiple attributes/dimensions, with correlations between the multiple dimensions. Effective query optimization in this context requires good selectivity estimates. In this paper, we use pruned count-suffix trees (PSTs) as the basic data structure for substring selectivity estimation. For the 1-D problem, we present a novel technique called MO (Maximal Overlap). We then develop and analyze two 1-D estimation algorithms, MOC and MOLC, based on MO and a constraint-based characterization of all possible completions of a given PST. For the k-D problem, we first generalize PSTs to multiple dimensions and develop a space- and time-efficient probabilistic algorithm to construct k-D PSTs directly. We then show how to extend MO to multiple dimensions. Finally, we demonstrate, both analytically and experimentally, that MO is both practical and substantially superior to competing algorithms.
Similar content being viewed by others
Author information
Authors and Affiliations
Additional information
Received April 28, 2000 / Accepted July 11, 2000
Rights and permissions
About this article
Cite this article
Jagadish, H., Kapitskaia, O., Ng, R. et al. One-dimensional and multi-dimensional substring selectivity estimation. The VLDB Journal 9, 214–230 (2000). https://doi.org/10.1007/s007780000029
Issue Date:
DOI: https://doi.org/10.1007/s007780000029