Skip to main content
Log in

One-dimensional and multi-dimensional substring selectivity estimation

  • Regular contribution
  • Published:
The VLDB Journal Aims and scope Submit manuscript

Abstract.

With the increasing importance of XML, LDAP directories, and text-based information sources on the Internet, there is an ever-greater need to evaluate queries involving (sub)string matching. In many cases, matches need to be on multiple attributes/dimensions, with correlations between the multiple dimensions. Effective query optimization in this context requires good selectivity estimates. In this paper, we use pruned count-suffix trees (PSTs) as the basic data structure for substring selectivity estimation. For the 1-D problem, we present a novel technique called MO (Maximal Overlap). We then develop and analyze two 1-D estimation algorithms, MOC and MOLC, based on MO and a constraint-based characterization of all possible completions of a given PST. For the k-D problem, we first generalize PSTs to multiple dimensions and develop a space- and time-efficient probabilistic algorithm to construct k-D PSTs directly. We then show how to extend MO to multiple dimensions. Finally, we demonstrate, both analytically and experimentally, that MO is both practical and substantially superior to competing algorithms.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

Author information

Authors and Affiliations

Authors

Additional information

Received April 28, 2000 / Accepted July 11, 2000

Rights and permissions

Reprints and permissions

About this article

Cite this article

Jagadish, H., Kapitskaia, O., Ng, R. et al. One-dimensional and multi-dimensional substring selectivity estimation. The VLDB Journal 9, 214–230 (2000). https://doi.org/10.1007/s007780000029

Download citation

  • Issue Date:

  • DOI: https://doi.org/10.1007/s007780000029

Navigation