Skip to main content
Log in

SEPIA: estimating selectivities of approximate string predicates in large Databases

  • Regular Paper
  • Published:
The VLDB Journal Aims and scope Submit manuscript

Abstract

Many database applications have the emerging need to support approximate queries that ask for strings that are similar to a given string, such as “name similar to smith” and “telephone number similar to 412-0964”. Query optimization needs the selectivity of such an approximate predicate, i.e., the fraction of records in the database that satisfy the condition. In this paper, we study the problem of estimating selectivities of approximate string predicates. We develop a novel technique, called Sepia, to solve the problem. Given a bag of strings, our technique groups the strings into clusters, builds a histogram structure for each cluster, and constructs a global histogram. It is based on the following intuition: given a query string q, a preselected string p in a cluster, and a string s in the cluster, based on the proximity between q and p, and the proximity between p and s, we can obtain a probability distribution from a global histogram about the similarity between q and s. We give a full specification of the technique using the edit distance metric. We study challenges in adopting this technique, including how to construct the histogram structures, how to use them to do selectivity estimation, and how to alleviate the effect of non-uniform errors in the estimation. We discuss how to extend the techniques to other similarity functions. Our extensive experiments on real data sets show that this technique can accurately estimate selectivities of approximate string predicates.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Aboulnaga, A., Alameldeen, A.R., Naughton, J.F.: Estimating the selectivity of XML path expressions for internet scale applications. VLDB (2001)

  2. Anathakrishna, R., Chaudhuri, S., Ganti, V.: Eliminating fuzzy duplicates in data warehouses. VLDB (2002)

  3. Bustos, B., Navarro, G., Ch’avez, E.: Pivot selection techniques for proximity searching in metric spaces. In: Proceedings of the XXI Conference of the Chilean Computer Science Society (SCCC’01) (2001)

  4. Chaudhuri, S., Ganjam, K., Ganti, V., Motwani, R.: Robust and efficient fuzzy match for online data cleaning. In: SIGMOD (2003)

  5. Chaudhuri, S., Ganti, V., Gravano, L.: Selectivity estimation for string predicates: overcoming the underestimation problem. In: International Conference on Data Engineering (2004)

  6. Chávez E., Navarro G., Baeza-Yates R.A., Marroquín J.L: Searching in metric spaces. ACM Comput. Surv. 33(3), 273–321 (2001)

    Article  Google Scholar 

  7. Chen, Z., Korn, F., Koudas, N., Muthukrishnan, S.: Selectivity estimation for boolean queries. In: Symposium on Principles of Database Systems, pp. 216–225 (2000)

  8. Ciaccia, P., Patella, M., Zezula, P.: M-tree: An efficient access method for similarity search in metric spaces. VLDB (1997)

  9. Cohen, W., Ravikumar, P., Fienberg, S.: A Comparison of String Metrics for Matching Names and Records. Data Cleaning Workshop in Conjunction with KDD (2003)

  10. Filho, R.F.S., Traina, A.J.M., Jr., C.T., Faloutsos, C.: Similarity search without tears: The OMNI family of all-purpose access methods. ICDE 623–630 (2001)

  11. Gower J.C., Legendre P.: Metric and euclidean properties of dissimilarity coefficients. J. Class. 3(1), 5–48 (1986)

    Article  MathSciNet  MATH  Google Scholar 

  12. Gravano, L., Ipeirotis, P., Koudas, N., Srivastava, D.: Approximate text joins and their integration into an RDBMS.In: Proceedings of WWW (2002)

  13. Gravano, L., Ipeirotis, P.G., Jagadish, H.V., Koudas, N., Muthukrishnan, S., Srivastava, D.: Approximate string joins in a database (almost) for free. VLDB 491–500 (2001)

  14. Hartigan, J.A., Wong, M.A.: A k-means clustering algorithm. Appl. Stat. 100–108 (1979)

  15. Herníndez, M.A., Stolfo, S.J.: The merge/purge problem for large databases. In: Carey, M.J., Schneider D.A. (eds.) SIGMOD, pp. 127–138 (1995)

  16. Hjaltason G.R., Samet H.: Index-driven similarity search in metric spaces. ACM Trans. Database Syst. 28(4), 517–580 (2003)

    Article  Google Scholar 

  17. Jagadish, H.V., Kapitskaia, O., Ng, R.T., Srivastava, D.: Multi-dimensional substring selectivity estimation. VLDB (1999)

  18. Jagadish, H.V., Koudas, N., Muthukrishnan, S., Poosala, V., Sevcik, K.C., Suel, T.: Optimal histograms with quality guarantees. VLDB 275–286 (1998)

  19. Jin, L., Koudas, N., Li, C.: NNH: improving performance of nearest-neighbor searches using histograms. In: EDBT (2004)

  20. Jin, L., Koudas, N., Li, C., Tung, A.K.: Indexing mixed types for approximate retrieval. In: Proceedings of VLDB (2005)

  21. Jin, L., Li, C.: Selectivity estimation for fuzzy string predicates in large data sets. In: VLDB 397–408 (2005)

  22. Jin, L., Li, C., Mehrotra, S.: Efficient record linkage in large data sets. In: Eighth International Conference on Database Systems for Advanced Applications (2003)

  23. Traina, C. Jr., Traina, A.J., Faloutsos, C.: Distance exponent: A new concept for selectivity estimation in metric trees. In: ICDE (2000)

  24. Kaufman L., Rousseeuw P.J.: Finding Groups in Data: An Introduction to Cluster Analysis. Wiley, New York (1990)

    Google Scholar 

  25. Kooi, R.P.: The optimization of queries in relational databases. Ph.D thesis, Case Western Reserve University (1980)

  26. Krishnan P., Vitter J.S., Iyer B.R.: Estimating alphanumeric selectivity in the presence of wildcards. In: SIGMOD, pp. 282–293. ACM Press, New York (1996)

  27. Lee, M.L., Ling, T.W., Low, W.L.: Intelliclean: a knowledge-based intelligent data cleaner. In: Knowledge Discovery and Data Mining, pp. 290–294 (2000)

  28. Levenshtein V.I.: Binary codes capable of correcting deletions, insertions and reversals. Soviet Physics Dokl. 10(8), 707–710 (1966) (Original in Russian in Doklady Akademii Nauk SSSR 163, 4, 845–848, 1965)

    MathSciNet  Google Scholar 

  29. Lim, L., Wang, M., Padmanabhan, S., Vitter, J., Parr, R.: Xpathlearner: an on-line selftuning markov histogram for xml path selectivity estimation. In: 28th International Conference on Very Large Data Bases (2002)

  30. Liu, Z., Luo, C., Cho, J., Chu, W.: A probabilistic approach to metasearching with adaptive probing. ICDE (2004)

  31. Mattias, Y., Vitter, J.S., Wang, M.: Dynamic maintenance of wavelet-based histograms. In: Proceedings of the International Conference on Very Large Databases, (VLDB), pp. 101–111. Cairo, Egypt (2000)

  32. Muralikrishna, M., DeWitt, D.J.: Equi-depth histograms for estimating selectivity factors for multi-dimensional queries. In: SIGMOD (1988)

  33. Navarro G.: A guided tour to approximate string matching. ACM Comput. Surv. 33(1), 31–88 (2001)

    Article  Google Scholar 

  34. Poosala, V., Haas, P.J., Ioannidis, Y.E., Shekita, E.J.: Improved histograms for selectivity estimation of range predicates. SIGMOD 294–305 (1996)

  35. Sahinalp, S.C., Tasan, M., Macker, J., Ozsoyoglu, Z.M.: Distance based indexing for string proximity search. In: International Conference on Data Engineering (2003)

  36. Sarawagi, S., Bhamidipaty, A.: Interactive deduplication using active learning. In: Proceedings of VLDB (2002)

  37. Smith T.F., Waterman M.S.: Identification of common molecular subsequences. J. Mol. Biol. 147, 195–197 (2001)

    Article  Google Scholar 

  38. Standard template library (stl) http://www.sgi.com/tech/stl/

  39. Tao, Y., Faloutsos, C., Papadias, D.: The power-method: a comprehensive estimation technique for multi-dimensional queries. In: CIKM (2003)

  40. Ukkonen E.: Algorithms for approximate string matching. Inform. Control 64(1–3), 100–118 (2001)

    MathSciNet  Google Scholar 

  41. Vleugels, J., Veltkamp, R.C.: Efficient image retrieval through vantage objects. Visual Inform. Inform. Syst. 575–584 (1999)

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Rares Vernica.

Additional information

A short version of this article appeared as [21] in the proceedings of the 31st International Conference on Very Large Data Bases (VLDB), August 30 – September 2, 2005, Trondheim, Norway. The source code of our algorithms is available at http://flamingo.ics.uci.edu/.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Jin, L., Li, C. & Vernica, R. SEPIA: estimating selectivities of approximate string predicates in large Databases. The VLDB Journal 17, 1213–1229 (2008). https://doi.org/10.1007/s00778-007-0061-2

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00778-007-0061-2

Keywords

Navigation