Skip to main content
Log in

Indexing metric uncertain data for range queries and range joins

  • Regular Paper
  • Published:
The VLDB Journal Aims and scope Submit manuscript

Abstract

Range queries and range joins in metric spaces have applications in many areas, including GIS, computational biology, and data integration, where metric uncertain data exist in different forms, resulting from circumstances such as equipment limitations, high-throughput sequencing technologies, and privacy preservation. We represent metric uncertain data by using an object-level model and a bi-level model, respectively. Two novel indexes, the uncertain pivot B \(^{+}\) -tree (UPB-tree) and the uncertain pivot B \(^{+}\) -forest (UPB-forest), are proposed in order to support probabilistic range queries and range joins for a wide range of uncertain data types and similarity metrics. Both index structures use a small set of effective pivots chosen based on a newly defined criterion and employ the B\(^{+}\)-tree(s) as the underlying index. In addition, we present efficient metric probabilistic range query and metric probabilistic range join algorithms, which utilize validation and pruning techniques based on derived probability lower and upper bounds. Extensive experiments with both real and synthetic data sets demonstrate that, compared against existing state-of-the-art indexes for metric uncertain data, the UPB-tree and the UPB-forest incur much lower construction costs, consume less storage space, and can support more efficient metric probabilistic range queries and metric probabilistic range joins.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17
Fig. 18
Fig. 19
Fig. 20
Fig. 21
Fig. 22
Fig. 23
Fig. 24

Similar content being viewed by others

Notes

  1. Available at http://www.sisap.org/Metric_Space_Library.html.

  2. Available at http://www.sisap.org/Metric_Space_Library.html.

  3. Available at http://www.dbs.informatik.uni-muenchen.de/~seidl.

  4. Available at http://www.ncbi.nlm.nih.gov/genome.

References

  1. Agarwal, P.K., Cheng, S.W., Tao, Y., Yi, K.: Indexing uncertain data. In: PODS, pp. 137–146 (2009)

  2. Aggarwal, C., Yu, P.: On high dimensional indexing of uncertain data. In: ICDE, pp. 1460–1461 (2008)

  3. Angiulli, F., Fassetti, F.: Indexing uncertain data in general metric space. IEEE Trans. Knowl. Data Eng. 24(9), 1640–1657 (2012)

    Article  Google Scholar 

  4. Bohm, C., Kunath, P., Schubert, M.: The Gauss-tree: efficient object identification of probabilistic feature vectors. In: ICDE, article 9 (2006)

  5. Bustos, B., Navarro, G., Chavez, E.: Pivot selection techniques for proximity searching in metric spaces. Pattern Recognit. Lett. 24(14), 2357–2366 (2003)

    Article  MATH  Google Scholar 

  6. Chen, J., Cheng, R.: Efficient evaluation of imprecise location-dependent queries. In: ICDE, pp. 586–595 (2007)

  7. Chen, L., Gao, Y., Li, X., Jensen, C.S., Chen, G.: Efficient metric indexing for similarity search. In: ICDE, pp. 591–602 (2015)

  8. Chen, L., Gao, Y., Li, X., Jensen, C.S., Chen, G., Zheng, B.: Indexing metric uncertain data for range queries. In: SIGMOD, pp. 951–965 (2015)

  9. Cheng, R., Singh, S., Prabhakar, S., Shah, R., Vitter, J.S., Xia, Y.: Efficient join processing over uncertain data. In: CIKM, pp. 738–747 (2006)

  10. Cheng, R., Xia, Y., Prabhakar, S., Shah, R., Vitter, J.S.: Efficient indexing methods for probabilistic threshold queries over uncertain data. In: VLDB, pp. 876–887 (2004)

  11. Chung, C.W., Pan, C.H., Liu, C.M.: An effective index for uncertain data. In: IS3C, pp. 482–485 (2014)

  12. Ciaccia, P., Patella, M., Zezula, P.: M-tree: an efficient access method for similarity search in metric spaces. In: VLDB, pp. 426–435 (1997)

  13. Dai, D., Xie, J., Zhang, H., Dong, J.: Efficient range queries over uncertain strings. In: SSDBM, pp. 75–95 (2012)

  14. Dallachiesa, M., Palpanas, T., Ilyas, I.F.: Top-\(k\) nearest neighbor search in uncertain data series. PVLDB 8(1), 13–24 (2014)

    Google Scholar 

  15. Fredriksson, K., Braithwaite, B.: Quicker similarity joins in metric spaces. In: SISAP, pp. 127–140 (2013)

  16. Frentzos, E., Gratsias, K., Theodoridis, Y.: On the effect of location uncertainty in spatial querying. IEEE Trans. Knowl. Data Eng. 21(3), 366–383 (2008)

    Article  Google Scholar 

  17. Gao, M., Jin, C., Wang, W., Lin, X., Zhou, A.: Similarity query processing for probabilistic sets. In: ICDE, pp. 913–924 (2013)

  18. Ge, T., Li, Z.: Approximate substring matching over uncertain strings. In: PVLDB vol. 4(11), pp. 772–782 (2011)

  19. Hoeffding, W.: Probability inequalities for sums of bounded random variables. J. Am. Stat. Assoc. 58(301), 13–30 (1963)

    Article  MathSciNet  MATH  Google Scholar 

  20. Jacox, E.H., Samet, H.: Metric space similarity joins. ACM Trans. Database Syst. 33(2), 7:1–7:38 (2008)

  21. Jestes, J., Li, F., Yan, Z., Yi, K.: Probabilistic string similarity joins. In: SIGMOD, pp. 327–338 (2010)

  22. Jin, R., Liu, L., Ding, B., Wang, H.: Distance constraint reachability computation in uncertain graphs. In: PVLDB vol. 4(9), pp. 511–562 (2011)

  23. Kimura, H., Madden, S., Zdonik, S.B.: UPI: a primary index for uncertain databases. In: PVLDB vol. 3(1), pp. 630–637 (2010)

  24. Knight, A., Yu, Q., Rege, M.: Efficient range query processing on complicated uncertain data. In: Ozyer, T., Kianmehr, K., Tan, M., Zeng, J. (eds.) Information Reuse and Integration in Academia and Industry, pp. 51–72. Springer, Vienna (2013)

  25. Kriegel, H.P., Bernecker, T., Renz, M., Zuefle, A.: Probabilistic join queries in uncertain databases. In: Aggarwal, C. C. (ed.) Managing and Mining Uncertain Data, pp. 257–298. Springer, New York (2009)

  26. Kriegel, H.P., Kunath, P., Pfeifle, M., Renz, M.: Probabilistic similarity join on uncertain data. In: DASFAA, pp. 295–309 (2006)

  27. Lian, X., Chen, L.: A generic framework for handling uncertain data with local correlations. In: PVLDB, vol. 4(1), pp. 12–21 (2010)

  28. Lian, X., Chen, L.: Set similarity join on probabilistic data. In: PVLDB, vol. 3(1), pp. 650–659 (2010)

  29. Mao, R., Mirankerb, W.L., Mirankerc, D.P.: Pivot selection: dimension reduction for distance-based indexing. J. Discrete Algorithms 13, 32–46 (2012)

    Article  MathSciNet  Google Scholar 

  30. Novak, D., Batko, M., Zezula, P.: Metric index: an efficient and scalable solution for precise and approximate similarity search. Inf. Syst. 36(4), 721–723 (2011)

    Article  Google Scholar 

  31. Paredes, R., Reyes, N.: Solving similarity joins and range queries in metric spaces with the list of twin clusters. J. Discrete Algorithms 7(1), 18–35 (2009)

    Article  MathSciNet  MATH  Google Scholar 

  32. Pearson, S.S., Silva, Y.N.: Index-based R-S similarity joins. In: SISAP, pp. 106–112 (2014)

  33. Sarma, A.D., He, Y., Chaudhuri, S.: Clusterjoin: a similarity joins framework using map-reduce. In: PVLDB, vol. 7(12), pp. 1059–1070 (2014)

  34. Silva, Y.N., Aref, W.G., Ali, M.H.: The similarity join database operator. In: ICDE, pp. 892–903 (2010)

  35. Silva, Y.N., Pearson, S.: Exploiting database similarity joins for metric spaces. In: PVLDB, vol. 5(12), pp. 1922–1925 (2012)

  36. Singh, S., Mayfield, C., Prabhakar, S., Shah, R., Hambrusch, S.E.: Indexing uncertain categorical data. In: ICDE, pp. 616–625 (2007)

  37. Skopal, T., Pokorny, J., Snasel, V.: PM-tree: pivoting metric tree for similarity search in multimedia databases. In: ADBIS, pp. 803–815 (2004)

  38. Tao, Y., Xiao, X., Cheng, R.: Range search on multidimensional uncertain data. ACM Trans. Database Syst. 32(3), 15:1–15:54 (2007)

  39. Traina Jr, C., Traina, A.J.M., Seeger, B., Faloutsos, C.: Slim-trees: high performance metric trees minimizing overlap between nodes. In: ICDE, pp. 51–65 (2000)

  40. Traina Jr, C., Filho, R.F.S., Traina, A.J.M., Vieira, M.R., Faloutsos, C.: The omni-family of all-purpose access methods: a simple and effective way to make similarity search more efficient. VLDB J. 16(4), 483–505 (2007)

  41. Vidal, E.: An algorithm for finding nearest neighbors in (approximately) constant average time. Pattern Recognit. Lett. 4(3), 145–157 (1986)

    Article  Google Scholar 

  42. Wang, Y., Metwally, A., Parthasarathy, S.: Scalable all-pairs similarity search in metric spaces. In: KDD, pp. 829–837 (2013)

  43. Zhang, Y., Lin, X., Zhang, W., Wang, J., Lin, Q.: Effectively indexing the uncertain space. IEEE Trans. Knowl. Data Eng. 22(9), 1247–1261 (2010)

    Article  Google Scholar 

  44. Zhang, Y., Zhang, W., Lin, Q., Lin, X.: Effectively indexing the multi-dimensional uncertain objects for range searching. In: EDBT, pp. 504–515 (2012)

  45. Zhu, R., Wang, B., Wang, G.: Indexing uncertain data for supporting range queries. In: WAIM, pp. 72–83 (2014)

Download references

Acknowledgements

This work was supported in part by the 973 Program of China No. 2015CB352502, the NSFC Grant Nos. 61522208, 61379033, and 61472348, the NSFC-Zhejiang Joint Fund Grant No. U1609217, and a grant from the Obel Family Foundation.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yunjun Gao.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Chen, L., Gao, Y., Zhong, A. et al. Indexing metric uncertain data for range queries and range joins. The VLDB Journal 26, 585–610 (2017). https://doi.org/10.1007/s00778-017-0465-6

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00778-017-0465-6

Keywords

Navigation