SEPIA: estimating selectivities of approximate string predicates in large Databases

Jin, Liang; Li, Chen; Vernica, Rares

doi:10.1007/s00778-007-0061-2

SEPIA: estimating selectivities of approximate string predicates in large Databases

Regular Paper
Published: 30 May 2008

Volume 17, pages 1213–1229, (2008)
Cite this article

The VLDB Journal Aims and scope Submit manuscript

Liang Jin¹,
Chen Li¹ &
Rares Vernica¹

103 Accesses
10 Citations
3 Altmetric
Explore all metrics

Abstract

Many database applications have the emerging need to support approximate queries that ask for strings that are similar to a given string, such as “name similar to smith” and “telephone number similar to 412-0964”. Query optimization needs the selectivity of such an approximate predicate, i.e., the fraction of records in the database that satisfy the condition. In this paper, we study the problem of estimating selectivities of approximate string predicates. We develop a novel technique, called Sepia, to solve the problem. Given a bag of strings, our technique groups the strings into clusters, builds a histogram structure for each cluster, and constructs a global histogram. It is based on the following intuition: given a query string q, a preselected string p in a cluster, and a string s in the cluster, based on the proximity between q and p, and the proximity between p and s, we can obtain a probability distribution from a global histogram about the similarity between q and s. We give a full specification of the technique using the edit distance metric. We study challenges in adopting this technique, including how to construct the histogram structures, how to use them to do selectivity estimation, and how to alleviate the effect of non-uniform errors in the estimation. We discuss how to extend the techniques to other similarity functions. Our extensive experiments on real data sets show that this technique can accurately estimate selectivities of approximate string predicates.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A unified framework for string similarity search with edit-distance constraint

Article 17 December 2016

Workload-Aware Self-Tuning Histograms of String Data

Computing Burrows-Wheeler Similarity Distributions for String Collections

References

Aboulnaga, A., Alameldeen, A.R., Naughton, J.F.: Estimating the selectivity of XML path expressions for internet scale applications. VLDB (2001)
Anathakrishna, R., Chaudhuri, S., Ganti, V.: Eliminating fuzzy duplicates in data warehouses. VLDB (2002)
Bustos, B., Navarro, G., Ch’avez, E.: Pivot selection techniques for proximity searching in metric spaces. In: Proceedings of the XXI Conference of the Chilean Computer Science Society (SCCC’01) (2001)
Chaudhuri, S., Ganjam, K., Ganti, V., Motwani, R.: Robust and efficient fuzzy match for online data cleaning. In: SIGMOD (2003)
Chaudhuri, S., Ganti, V., Gravano, L.: Selectivity estimation for string predicates: overcoming the underestimation problem. In: International Conference on Data Engineering (2004)
Chávez E., Navarro G., Baeza-Yates R.A., Marroquín J.L: Searching in metric spaces. ACM Comput. Surv. 33(3), 273–321 (2001)
Article Google Scholar
Chen, Z., Korn, F., Koudas, N., Muthukrishnan, S.: Selectivity estimation for boolean queries. In: Symposium on Principles of Database Systems, pp. 216–225 (2000)
Ciaccia, P., Patella, M., Zezula, P.: M-tree: An efficient access method for similarity search in metric spaces. VLDB (1997)
Cohen, W., Ravikumar, P., Fienberg, S.: A Comparison of String Metrics for Matching Names and Records. Data Cleaning Workshop in Conjunction with KDD (2003)
Filho, R.F.S., Traina, A.J.M., Jr., C.T., Faloutsos, C.: Similarity search without tears: The OMNI family of all-purpose access methods. ICDE 623–630 (2001)
Gower J.C., Legendre P.: Metric and euclidean properties of dissimilarity coefficients. J. Class. 3(1), 5–48 (1986)
Article MathSciNet MATH Google Scholar
Gravano, L., Ipeirotis, P., Koudas, N., Srivastava, D.: Approximate text joins and their integration into an RDBMS.In: Proceedings of WWW (2002)
Gravano, L., Ipeirotis, P.G., Jagadish, H.V., Koudas, N., Muthukrishnan, S., Srivastava, D.: Approximate string joins in a database (almost) for free. VLDB 491–500 (2001)
Hartigan, J.A., Wong, M.A.: A k-means clustering algorithm. Appl. Stat. 100–108 (1979)
Herníndez, M.A., Stolfo, S.J.: The merge/purge problem for large databases. In: Carey, M.J., Schneider D.A. (eds.) SIGMOD, pp. 127–138 (1995)
Hjaltason G.R., Samet H.: Index-driven similarity search in metric spaces. ACM Trans. Database Syst. 28(4), 517–580 (2003)
Article Google Scholar
Jagadish, H.V., Kapitskaia, O., Ng, R.T., Srivastava, D.: Multi-dimensional substring selectivity estimation. VLDB (1999)
Jagadish, H.V., Koudas, N., Muthukrishnan, S., Poosala, V., Sevcik, K.C., Suel, T.: Optimal histograms with quality guarantees. VLDB 275–286 (1998)
Jin, L., Koudas, N., Li, C.: NNH: improving performance of nearest-neighbor searches using histograms. In: EDBT (2004)
Jin, L., Koudas, N., Li, C., Tung, A.K.: Indexing mixed types for approximate retrieval. In: Proceedings of VLDB (2005)
Jin, L., Li, C.: Selectivity estimation for fuzzy string predicates in large data sets. In: VLDB 397–408 (2005)
Jin, L., Li, C., Mehrotra, S.: Efficient record linkage in large data sets. In: Eighth International Conference on Database Systems for Advanced Applications (2003)
Traina, C. Jr., Traina, A.J., Faloutsos, C.: Distance exponent: A new concept for selectivity estimation in metric trees. In: ICDE (2000)
Kaufman L., Rousseeuw P.J.: Finding Groups in Data: An Introduction to Cluster Analysis. Wiley, New York (1990)
Google Scholar
Kooi, R.P.: The optimization of queries in relational databases. Ph.D thesis, Case Western Reserve University (1980)
Krishnan P., Vitter J.S., Iyer B.R.: Estimating alphanumeric selectivity in the presence of wildcards. In: SIGMOD, pp. 282–293. ACM Press, New York (1996)
Lee, M.L., Ling, T.W., Low, W.L.: Intelliclean: a knowledge-based intelligent data cleaner. In: Knowledge Discovery and Data Mining, pp. 290–294 (2000)
Levenshtein V.I.: Binary codes capable of correcting deletions, insertions and reversals. Soviet Physics Dokl. 10(8), 707–710 (1966) (Original in Russian in Doklady Akademii Nauk SSSR 163, 4, 845–848, 1965)
MathSciNet Google Scholar
Lim, L., Wang, M., Padmanabhan, S., Vitter, J., Parr, R.: Xpathlearner: an on-line selftuning markov histogram for xml path selectivity estimation. In: 28th International Conference on Very Large Data Bases (2002)
Liu, Z., Luo, C., Cho, J., Chu, W.: A probabilistic approach to metasearching with adaptive probing. ICDE (2004)
Mattias, Y., Vitter, J.S., Wang, M.: Dynamic maintenance of wavelet-based histograms. In: Proceedings of the International Conference on Very Large Databases, (VLDB), pp. 101–111. Cairo, Egypt (2000)
Muralikrishna, M., DeWitt, D.J.: Equi-depth histograms for estimating selectivity factors for multi-dimensional queries. In: SIGMOD (1988)
Navarro G.: A guided tour to approximate string matching. ACM Comput. Surv. 33(1), 31–88 (2001)
Article Google Scholar
Poosala, V., Haas, P.J., Ioannidis, Y.E., Shekita, E.J.: Improved histograms for selectivity estimation of range predicates. SIGMOD 294–305 (1996)
Sahinalp, S.C., Tasan, M., Macker, J., Ozsoyoglu, Z.M.: Distance based indexing for string proximity search. In: International Conference on Data Engineering (2003)
Sarawagi, S., Bhamidipaty, A.: Interactive deduplication using active learning. In: Proceedings of VLDB (2002)
Smith T.F., Waterman M.S.: Identification of common molecular subsequences. J. Mol. Biol. 147, 195–197 (2001)
Article Google Scholar
Standard template library (stl) http://www.sgi.com/tech/stl/
Tao, Y., Faloutsos, C., Papadias, D.: The power-method: a comprehensive estimation technique for multi-dimensional queries. In: CIKM (2003)
Ukkonen E.: Algorithms for approximate string matching. Inform. Control 64(1–3), 100–118 (2001)
MathSciNet Google Scholar
Vleugels, J., Veltkamp, R.C.: Efficient image retrieval through vantage objects. Visual Inform. Inform. Syst. 575–584 (1999)

Download references

Author information

Authors and Affiliations

University of California, Irvine, USA
Liang Jin, Chen Li & Rares Vernica

Authors

Liang Jin
View author publications
You can also search for this author in PubMed Google Scholar
Chen Li
View author publications
You can also search for this author in PubMed Google Scholar
Rares Vernica
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Rares Vernica.

Additional information

A short version of this article appeared as [21] in the proceedings of the 31st International Conference on Very Large Data Bases (VLDB), August 30 – September 2, 2005, Trondheim, Norway. The source code of our algorithms is available at http://flamingo.ics.uci.edu/.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Jin, L., Li, C. & Vernica, R. SEPIA: estimating selectivities of approximate string predicates in large Databases. The VLDB Journal 17, 1213–1229 (2008). https://doi.org/10.1007/s00778-007-0061-2

Download citation

Received: 08 August 2006
Revised: 08 May 2007
Accepted: 14 May 2007
Published: 30 May 2008
Issue Date: August 2008
DOI: https://doi.org/10.1007/s00778-007-0061-2

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

SEPIA: estimating selectivities of approximate string predicates in large Databases

Abstract

Access this article

Similar content being viewed by others

A unified framework for string similarity search with edit-distance constraint

Workload-Aware Self-Tuning Histograms of String Data

Computing Burrows-Wheeler Similarity Distributions for String Collections

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Navigation

SEPIA: estimating selectivities of approximate string predicates in large Databases

Abstract

Access this article

Similar content being viewed by others

A unified framework for string similarity search with edit-distance constraint

Workload-Aware Self-Tuning Histograms of String Data

Computing Burrows-Wheeler Similarity Distributions for String Collections

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation