Skip to main content

Selectivity Estimation on Set Containment Search

  • Conference paper
  • First Online:
Database Systems for Advanced Applications (DASFAA 2019)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 11446))

Included in the following conference series:

Abstract

In this paper, we study the problem of selectivity estimation on set containment search. Given a query record Q and a record dataset \(\mathcal S\), we aim to accurately and efficiently estimate the selectivity of set containment search of query Q over \(\mathcal S\). The problem has many important applications in commercial fields and scientific studies.

To the best of our knowledge, this is the first work to study this important problem. We first extend existing distinct value estimating techniques to solve this problem and develop an inverted list and G-KMV sketch based approach IL-GKMV. We analyse that the performance of IL-GKMV degrades with the increase of vocabulary size. Motivated by limitations of existing techniques and the inherent challenges of the problem, we resort to developing effective and efficient sampling approaches and propose an ordered trie structure based sampling approach named OT-Sampling. OT-Sampling partitions records based on element frequency and occurrence patterns and is significantly more accurate compared with simple random sampling method and IL-GKMV. To further enhance performance, a divide-and-conquer based sampling approach, DC-Sampling, is presented with an inclusion/exclusion prefix to explore the pruning opportunities. We theoretically analyse the proposed techniques regarding various accuracy estimators. Our comprehensive experiments on 6 real datasets verify the effectiveness and efficiency of our proposed techniques.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. http://www.informatik.uni-freiburg.de/~cziegler/BX/

  2. http://dai-labor.de/IRML/datasets

  3. http://socialnetworks.mpi-sws.org/data-imc2007.html

  4. http://vi.sualize.us/

  5. Agarwal, P.K.: Range searching. Technical report, Duke University Durham NC Dept of Computer Science (1996)

    Google Scholar 

  6. Agrawal, P., Arasu, A., Kaushik, R.: On indexing error-tolerant set containment. In: SIGMOD, pp. 927–938 (2010)

    Google Scholar 

  7. Baeza-Yates, R., Ribeiro-Neto, B., et al.: Modern Information Retrieval, vol. 463. ACM Press, New York (1999)

    Google Scholar 

  8. Bar-Yossef, Z., Jayram, T.S., Kumar, R., Sivakumar, D., Trevisan, L.: Counting distinct elements in a data stream. In: Rolim, J.D.P., Vadhan, S. (eds.) RANDOM 2002. LNCS, vol. 2483, pp. 1–10. Springer, Heidelberg (2002). https://doi.org/10.1007/3-540-45726-7_1

    Chapter  MATH  Google Scholar 

  9. Beyer, K., Haas, P.J., Reinwald, B., Sismanis, Y., Gemulla, R.: On synopses for distinct-value estimation under multiset operations. In: SIGMOD, pp. 199–210 (2007)

    Google Scholar 

  10. Bouros, P., Mamoulis, N., Ge, S., Terrovitis, M.: Set containment join revisited. Knowl. Inf. Syst. 1–28 (2015)

    Google Scholar 

  11. Chen, Z., Korn, F., Koudas, N., Muthukrishnan, S.: Selectivity estimation for boolean queries. In: PODS, pp. 216–225 (2000)

    Google Scholar 

  12. Cohen, E., Cormode, G., Duffield, N.G.: Structure-aware sampling on data streams. In: SIGMETRICS, pp. 197–208 (2011)

    Google Scholar 

  13. Cohen, E., Cormode, G., Duffield, N.G.: Is min-wise hashing optimal for summarizing set intersction? In: PODS, pp. 109–120 (2014)

    Google Scholar 

  14. Das, A., Gehrke, J., Riedewald, M.: Approximation techniques for spatial data. In: SIGMOD, pp. 695–706 (2004)

    Google Scholar 

  15. Goldman, R., Widom, J.: WSQ/DSQ: a practical approach for combined querying of databases and the web. In: ACM SIGMOD Record, vol. 29, pp. 285–296. ACM (2000)

    Google Scholar 

  16. Helmer, S., Moerkotte, G.: A performance study of four index structures for set-valued attributes of low cardinality. VLDB J. 12(3), 244–261 (2003)

    Article  Google Scholar 

  17. Jampani, R., Pudi, V.: Using prefix-trees for efficiently computing set joins. In: Zhou, L., Ooi, B.C., Meng, X. (eds.) DASFAA 2005. LNCS, vol. 3453, pp. 761–772. Springer, Heidelberg (2005). https://doi.org/10.1007/11408079_69

    Chapter  Google Scholar 

  18. Li, C., Lu, J., Lu, Y.: Efficient merging and filtering algorithms for approximate string searches. In: ICDE, pp. 257–266 (2008)

    Google Scholar 

  19. Luo, Y., Fletcher, G.H., Hidders, J., De Bra, P.: Efficient and scalable trie-based algorithms for computing set containment relations. In: ICDE, pp. 303–314. IEEE (2015)

    Google Scholar 

  20. Mamoulis, N.: Efficient processing of joins on set-valued attributes. In: SIGMOD, pp. 157–168. ACM (2003)

    Google Scholar 

  21. Melnik, S., Garcia Molina, H.: Adaptive algorithms for set containment joins. TODS 28(1), 56–99 (2003)

    Article  Google Scholar 

  22. Ramasamy, K., Patel, J.M., Naughton, J.F., Kaushik, R.: Set containment joins: the good, the bad and the ugly. In: VLDB, pp. 351–362 (2000)

    Google Scholar 

  23. Suri, S., Toth, C., Zhou, Y.: Range counting over multidimensional data streams. Discrete Comput. Geometry 36(4), 633–655 (2006)

    Article  MathSciNet  Google Scholar 

  24. Terrovitis, M., Bouros, P., Vassiliadis, P., Sellis, T., Mamoulis, N.: Efficient answering of set containment queries for skewed item distributions. In: Proceedings of the 14th International Conference on Extending Database Technology, pp. 225–236. ACM (2011)

    Google Scholar 

  25. Tzoumas, K., Deshpande, A., Jensen, C.S.: Efficiently adapting graphical models for selectivity estimation. PVLDB 22(1), 3–27 (2013)

    Google Scholar 

  26. Vernica, R., Carey, M.J., Li, C.: Efficient parallel set-similarity joins using MapReduce. In: SIGMOD, pp. 495–506. ACM (2010)

    Google Scholar 

  27. Wang, J., Li, G., Feng, J.: Can we beat the prefix filtering?: an adaptive framework for similarity join and search. In: SIGMOD, pp. 85–96 (2012)

    Google Scholar 

  28. Wang, X., Zhang, Y., Zhang, W., Lin, X., Wang, W.: Selectivity estimation on streaming spatio-textual data using local correlations. PVLDB 8(2), 101–112 (2014)

    Google Scholar 

  29. Xiao, C., Wang, W., Lin, X., Yu, J.X.: Efficient similarity joins for near duplicate detection. In: WWW, pp. 131–140 (2008)

    Google Scholar 

  30. Yang, J., Zhang, W., Yang, S., Zhang, Y. Lin, X.: TT-join: efficient set containment join. In: ICDE, pp. 509–520 (2017)

    Google Scholar 

  31. Yang, Y., Zhang, Y., Zhang, W., Huang, Z.: GB-KMV: an augmented kmv sketch for approximate containment similarity search. arXiv preprint arXiv:1809.00458 (2018)

  32. Zhu, E., Nargesian, F., Pu, K.Q., Miller, R.J.: LSH ensemble: internet scale domain search. In: VLDB, pp. 1185–1196 (2016)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yang Yang .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2019 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Yang, Y., Zhang, W., Zhang, Y., Lin, X., Wang, L. (2019). Selectivity Estimation on Set Containment Search. In: Li, G., Yang, J., Gama, J., Natwichai, J., Tong, Y. (eds) Database Systems for Advanced Applications. DASFAA 2019. Lecture Notes in Computer Science(), vol 11446. Springer, Cham. https://doi.org/10.1007/978-3-030-18576-3_20

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-18576-3_20

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-18575-6

  • Online ISBN: 978-3-030-18576-3

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics