Abstract
Many database applications have the emerging need to support approximate queries that ask for strings that are similar to a given string, such as “name similar to smith” and “telephone number similar to 412-0964”. Query optimization needs the selectivity of such an approximate predicate, i.e., the fraction of records in the database that satisfy the condition. In this paper, we study the problem of estimating selectivities of approximate string predicates. We develop a novel technique, called Sepia, to solve the problem. Given a bag of strings, our technique groups the strings into clusters, builds a histogram structure for each cluster, and constructs a global histogram. It is based on the following intuition: given a query string q, a preselected string p in a cluster, and a string s in the cluster, based on the proximity between q and p, and the proximity between p and s, we can obtain a probability distribution from a global histogram about the similarity between q and s. We give a full specification of the technique using the edit distance metric. We study challenges in adopting this technique, including how to construct the histogram structures, how to use them to do selectivity estimation, and how to alleviate the effect of non-uniform errors in the estimation. We discuss how to extend the techniques to other similarity functions. Our extensive experiments on real data sets show that this technique can accurately estimate selectivities of approximate string predicates.
Similar content being viewed by others
References
Aboulnaga, A., Alameldeen, A.R., Naughton, J.F.: Estimating the selectivity of XML path expressions for internet scale applications. VLDB (2001)
Anathakrishna, R., Chaudhuri, S., Ganti, V.: Eliminating fuzzy duplicates in data warehouses. VLDB (2002)
Bustos, B., Navarro, G., Ch’avez, E.: Pivot selection techniques for proximity searching in metric spaces. In: Proceedings of the XXI Conference of the Chilean Computer Science Society (SCCC’01) (2001)
Chaudhuri, S., Ganjam, K., Ganti, V., Motwani, R.: Robust and efficient fuzzy match for online data cleaning. In: SIGMOD (2003)
Chaudhuri, S., Ganti, V., Gravano, L.: Selectivity estimation for string predicates: overcoming the underestimation problem. In: International Conference on Data Engineering (2004)
Chávez E., Navarro G., Baeza-Yates R.A., Marroquín J.L: Searching in metric spaces. ACM Comput. Surv. 33(3), 273–321 (2001)
Chen, Z., Korn, F., Koudas, N., Muthukrishnan, S.: Selectivity estimation for boolean queries. In: Symposium on Principles of Database Systems, pp. 216–225 (2000)
Ciaccia, P., Patella, M., Zezula, P.: M-tree: An efficient access method for similarity search in metric spaces. VLDB (1997)
Cohen, W., Ravikumar, P., Fienberg, S.: A Comparison of String Metrics for Matching Names and Records. Data Cleaning Workshop in Conjunction with KDD (2003)
Filho, R.F.S., Traina, A.J.M., Jr., C.T., Faloutsos, C.: Similarity search without tears: The OMNI family of all-purpose access methods. ICDE 623–630 (2001)
Gower J.C., Legendre P.: Metric and euclidean properties of dissimilarity coefficients. J. Class. 3(1), 5–48 (1986)
Gravano, L., Ipeirotis, P., Koudas, N., Srivastava, D.: Approximate text joins and their integration into an RDBMS.In: Proceedings of WWW (2002)
Gravano, L., Ipeirotis, P.G., Jagadish, H.V., Koudas, N., Muthukrishnan, S., Srivastava, D.: Approximate string joins in a database (almost) for free. VLDB 491–500 (2001)
Hartigan, J.A., Wong, M.A.: A k-means clustering algorithm. Appl. Stat. 100–108 (1979)
Herníndez, M.A., Stolfo, S.J.: The merge/purge problem for large databases. In: Carey, M.J., Schneider D.A. (eds.) SIGMOD, pp. 127–138 (1995)
Hjaltason G.R., Samet H.: Index-driven similarity search in metric spaces. ACM Trans. Database Syst. 28(4), 517–580 (2003)
Jagadish, H.V., Kapitskaia, O., Ng, R.T., Srivastava, D.: Multi-dimensional substring selectivity estimation. VLDB (1999)
Jagadish, H.V., Koudas, N., Muthukrishnan, S., Poosala, V., Sevcik, K.C., Suel, T.: Optimal histograms with quality guarantees. VLDB 275–286 (1998)
Jin, L., Koudas, N., Li, C.: NNH: improving performance of nearest-neighbor searches using histograms. In: EDBT (2004)
Jin, L., Koudas, N., Li, C., Tung, A.K.: Indexing mixed types for approximate retrieval. In: Proceedings of VLDB (2005)
Jin, L., Li, C.: Selectivity estimation for fuzzy string predicates in large data sets. In: VLDB 397–408 (2005)
Jin, L., Li, C., Mehrotra, S.: Efficient record linkage in large data sets. In: Eighth International Conference on Database Systems for Advanced Applications (2003)
Traina, C. Jr., Traina, A.J., Faloutsos, C.: Distance exponent: A new concept for selectivity estimation in metric trees. In: ICDE (2000)
Kaufman L., Rousseeuw P.J.: Finding Groups in Data: An Introduction to Cluster Analysis. Wiley, New York (1990)
Kooi, R.P.: The optimization of queries in relational databases. Ph.D thesis, Case Western Reserve University (1980)
Krishnan P., Vitter J.S., Iyer B.R.: Estimating alphanumeric selectivity in the presence of wildcards. In: SIGMOD, pp. 282–293. ACM Press, New York (1996)
Lee, M.L., Ling, T.W., Low, W.L.: Intelliclean: a knowledge-based intelligent data cleaner. In: Knowledge Discovery and Data Mining, pp. 290–294 (2000)
Levenshtein V.I.: Binary codes capable of correcting deletions, insertions and reversals. Soviet Physics Dokl. 10(8), 707–710 (1966) (Original in Russian in Doklady Akademii Nauk SSSR 163, 4, 845–848, 1965)
Lim, L., Wang, M., Padmanabhan, S., Vitter, J., Parr, R.: Xpathlearner: an on-line selftuning markov histogram for xml path selectivity estimation. In: 28th International Conference on Very Large Data Bases (2002)
Liu, Z., Luo, C., Cho, J., Chu, W.: A probabilistic approach to metasearching with adaptive probing. ICDE (2004)
Mattias, Y., Vitter, J.S., Wang, M.: Dynamic maintenance of wavelet-based histograms. In: Proceedings of the International Conference on Very Large Databases, (VLDB), pp. 101–111. Cairo, Egypt (2000)
Muralikrishna, M., DeWitt, D.J.: Equi-depth histograms for estimating selectivity factors for multi-dimensional queries. In: SIGMOD (1988)
Navarro G.: A guided tour to approximate string matching. ACM Comput. Surv. 33(1), 31–88 (2001)
Poosala, V., Haas, P.J., Ioannidis, Y.E., Shekita, E.J.: Improved histograms for selectivity estimation of range predicates. SIGMOD 294–305 (1996)
Sahinalp, S.C., Tasan, M., Macker, J., Ozsoyoglu, Z.M.: Distance based indexing for string proximity search. In: International Conference on Data Engineering (2003)
Sarawagi, S., Bhamidipaty, A.: Interactive deduplication using active learning. In: Proceedings of VLDB (2002)
Smith T.F., Waterman M.S.: Identification of common molecular subsequences. J. Mol. Biol. 147, 195–197 (2001)
Standard template library (stl) http://www.sgi.com/tech/stl/
Tao, Y., Faloutsos, C., Papadias, D.: The power-method: a comprehensive estimation technique for multi-dimensional queries. In: CIKM (2003)
Ukkonen E.: Algorithms for approximate string matching. Inform. Control 64(1–3), 100–118 (2001)
Vleugels, J., Veltkamp, R.C.: Efficient image retrieval through vantage objects. Visual Inform. Inform. Syst. 575–584 (1999)
Author information
Authors and Affiliations
Corresponding author
Additional information
A short version of this article appeared as [21] in the proceedings of the 31st International Conference on Very Large Data Bases (VLDB), August 30 – September 2, 2005, Trondheim, Norway. The source code of our algorithms is available at http://flamingo.ics.uci.edu/.
Rights and permissions
About this article
Cite this article
Jin, L., Li, C. & Vernica, R. SEPIA: estimating selectivities of approximate string predicates in large Databases. The VLDB Journal 17, 1213–1229 (2008). https://doi.org/10.1007/s00778-007-0061-2
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00778-007-0061-2