Selectivity Estimation on Set Containment Search

Conference paper
First Online: 24 April 2019

pp 330–349
Cite this conference paper

Database Systems for Advanced Applications (DASFAA 2019)

Yang Yang^19,20,
Wenjie Zhang^19,20,
Ying Zhang²¹,
Xuemin Lin^19,20,22 &
…
Liping Wang²²

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 11446))

Included in the following conference series:

International Conference on Database Systems for Advanced Applications

3496 Accesses
3 Citations

Abstract

In this paper, we study the problem of selectivity estimation on set containment search. Given a query record Q and a record dataset \(\mathcal S\), we aim to accurately and efficiently estimate the selectivity of set containment search of query Q over \(\mathcal S\). The problem has many important applications in commercial fields and scientific studies.

To the best of our knowledge, this is the first work to study this important problem. We first extend existing distinct value estimating techniques to solve this problem and develop an inverted list and G-KMV sketch based approach IL-GKMV. We analyse that the performance of IL-GKMV degrades with the increase of vocabulary size. Motivated by limitations of existing techniques and the inherent challenges of the problem, we resort to developing effective and efficient sampling approaches and propose an ordered trie structure based sampling approach named OT-Sampling. OT-Sampling partitions records based on element frequency and occurrence patterns and is significantly more accurate compared with simple random sampling method and IL-GKMV. To further enhance performance, a divide-and-conquer based sampling approach, DC-Sampling, is presented with an inclusion/exclusion prefix to explore the pruning opportunities. We theoretically analyse the proposed techniques regarding various accuracy estimators. Our comprehensive experiments on 6 real datasets verify the effectiveness and efficiency of our proposed techniques.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Selectivity Estimation on Set Containment Search

Article Open access 23 September 2019

Set containment join revisited

Article 26 October 2015

FreshJoin: An Efficient and Adaptive Algorithm for Set Containment Join

Article Open access 09 November 2019

References

http://www.informatik.uni-freiburg.de/~cziegler/BX/
http://dai-labor.de/IRML/datasets
http://socialnetworks.mpi-sws.org/data-imc2007.html
http://vi.sualize.us/
Agarwal, P.K.: Range searching. Technical report, Duke University Durham NC Dept of Computer Science (1996)
Google Scholar
Agrawal, P., Arasu, A., Kaushik, R.: On indexing error-tolerant set containment. In: SIGMOD, pp. 927–938 (2010)
Google Scholar
Baeza-Yates, R., Ribeiro-Neto, B., et al.: Modern Information Retrieval, vol. 463. ACM Press, New York (1999)
Google Scholar
Bar-Yossef, Z., Jayram, T.S., Kumar, R., Sivakumar, D., Trevisan, L.: Counting distinct elements in a data stream. In: Rolim, J.D.P., Vadhan, S. (eds.) RANDOM 2002. LNCS, vol. 2483, pp. 1–10. Springer, Heidelberg (2002). https://doi.org/10.1007/3-540-45726-7_1
Chapter MATH Google Scholar
Beyer, K., Haas, P.J., Reinwald, B., Sismanis, Y., Gemulla, R.: On synopses for distinct-value estimation under multiset operations. In: SIGMOD, pp. 199–210 (2007)
Google Scholar
Bouros, P., Mamoulis, N., Ge, S., Terrovitis, M.: Set containment join revisited. Knowl. Inf. Syst. 1–28 (2015)
Google Scholar
Chen, Z., Korn, F., Koudas, N., Muthukrishnan, S.: Selectivity estimation for boolean queries. In: PODS, pp. 216–225 (2000)
Google Scholar
Cohen, E., Cormode, G., Duffield, N.G.: Structure-aware sampling on data streams. In: SIGMETRICS, pp. 197–208 (2011)
Google Scholar
Cohen, E., Cormode, G., Duffield, N.G.: Is min-wise hashing optimal for summarizing set intersction? In: PODS, pp. 109–120 (2014)
Google Scholar
Das, A., Gehrke, J., Riedewald, M.: Approximation techniques for spatial data. In: SIGMOD, pp. 695–706 (2004)
Google Scholar
Goldman, R., Widom, J.: WSQ/DSQ: a practical approach for combined querying of databases and the web. In: ACM SIGMOD Record, vol. 29, pp. 285–296. ACM (2000)
Google Scholar
Helmer, S., Moerkotte, G.: A performance study of four index structures for set-valued attributes of low cardinality. VLDB J. 12(3), 244–261 (2003)
Article Google Scholar
Jampani, R., Pudi, V.: Using prefix-trees for efficiently computing set joins. In: Zhou, L., Ooi, B.C., Meng, X. (eds.) DASFAA 2005. LNCS, vol. 3453, pp. 761–772. Springer, Heidelberg (2005). https://doi.org/10.1007/11408079_69
Chapter Google Scholar
Li, C., Lu, J., Lu, Y.: Efficient merging and filtering algorithms for approximate string searches. In: ICDE, pp. 257–266 (2008)
Google Scholar
Luo, Y., Fletcher, G.H., Hidders, J., De Bra, P.: Efficient and scalable trie-based algorithms for computing set containment relations. In: ICDE, pp. 303–314. IEEE (2015)
Google Scholar
Mamoulis, N.: Efficient processing of joins on set-valued attributes. In: SIGMOD, pp. 157–168. ACM (2003)
Google Scholar
Melnik, S., Garcia Molina, H.: Adaptive algorithms for set containment joins. TODS 28(1), 56–99 (2003)
Article Google Scholar
Ramasamy, K., Patel, J.M., Naughton, J.F., Kaushik, R.: Set containment joins: the good, the bad and the ugly. In: VLDB, pp. 351–362 (2000)
Google Scholar
Suri, S., Toth, C., Zhou, Y.: Range counting over multidimensional data streams. Discrete Comput. Geometry 36(4), 633–655 (2006)
Article MathSciNet Google Scholar
Terrovitis, M., Bouros, P., Vassiliadis, P., Sellis, T., Mamoulis, N.: Efficient answering of set containment queries for skewed item distributions. In: Proceedings of the 14th International Conference on Extending Database Technology, pp. 225–236. ACM (2011)
Google Scholar
Tzoumas, K., Deshpande, A., Jensen, C.S.: Efficiently adapting graphical models for selectivity estimation. PVLDB 22(1), 3–27 (2013)
Google Scholar
Vernica, R., Carey, M.J., Li, C.: Efficient parallel set-similarity joins using MapReduce. In: SIGMOD, pp. 495–506. ACM (2010)
Google Scholar
Wang, J., Li, G., Feng, J.: Can we beat the prefix filtering?: an adaptive framework for similarity join and search. In: SIGMOD, pp. 85–96 (2012)
Google Scholar
Wang, X., Zhang, Y., Zhang, W., Lin, X., Wang, W.: Selectivity estimation on streaming spatio-textual data using local correlations. PVLDB 8(2), 101–112 (2014)
Google Scholar
Xiao, C., Wang, W., Lin, X., Yu, J.X.: Efficient similarity joins for near duplicate detection. In: WWW, pp. 131–140 (2008)
Google Scholar
Yang, J., Zhang, W., Yang, S., Zhang, Y. Lin, X.: TT-join: efficient set containment join. In: ICDE, pp. 509–520 (2017)
Google Scholar
Yang, Y., Zhang, Y., Zhang, W., Huang, Z.: GB-KMV: an augmented kmv sketch for approximate containment similarity search. arXiv preprint arXiv:1809.00458 (2018)
Zhu, E., Nargesian, F., Pu, K.Q., Miller, R.J.: LSH ensemble: internet scale domain search. In: VLDB, pp. 1185–1196 (2016)
Google Scholar

Download references

Author information

Authors and Affiliations

Guangzhou University, Guangzhou, China
Yang Yang, Wenjie Zhang & Xuemin Lin
UNSW, Sydney, Australia
Yang Yang, Wenjie Zhang & Xuemin Lin
University of Technology Sydney, Sydney, Australia
Ying Zhang
East China Normal University, Shanghai, China
Xuemin Lin & Liping Wang

Authors

Yang Yang
View author publications
You can also search for this author in PubMed Google Scholar
Wenjie Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Ying Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Xuemin Lin
View author publications
You can also search for this author in PubMed Google Scholar
Liping Wang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yang Yang .

Editor information

Editors and Affiliations

Tsinghua University, Beijing, China
Guoliang Li
Duke University, Durham, NC, USA
Jun Yang
University of Porto, Porto, Portugal
Joao Gama
Chiang Mai University, Chiang Mai, Thailand
Juggapong Natwichai
Beihang University, Beijing, China
Yongxin Tong

Rights and permissions

Reprints and permissions

Copyright information

© 2019 Springer Nature Switzerland AG

About this paper

Cite this paper

Yang, Y., Zhang, W., Zhang, Y., Lin, X., Wang, L. (2019). Selectivity Estimation on Set Containment Search. In: Li, G., Yang, J., Gama, J., Natwichai, J., Tong, Y. (eds) Database Systems for Advanced Applications. DASFAA 2019. Lecture Notes in Computer Science(), vol 11446. Springer, Cham. https://doi.org/10.1007/978-3-030-18576-3_20

Download citation

DOI: https://doi.org/10.1007/978-3-030-18576-3_20
Published: 24 April 2019
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-18575-6
Online ISBN: 978-3-030-18576-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions