Skip to main content
Log in

Super-EGO: fast multi-dimensional similarity join

  • Regular Paper
  • Published:
The VLDB Journal Aims and scope Submit manuscript

Abstract

Efficient processing of high-dimensional similarity joins plays an important role for a wide variety of data-driven applications. In this paper, we consider \(\varepsilon \)-join variant of the problem. Given two \(d\)-dimensional datasets and parameter \(\varepsilon \), the task is to find all pairs of points, one from each dataset that are within \(\varepsilon \) distance from each other. We propose a new \(\varepsilon \)-join algorithm, called Super-EGO, which belongs the EGO family of join algorithms. The new algorithm gains its advantage by using novel data-driven dimensionality re-ordering technique, developing a new EGO-strategy that more aggressively avoids unnecessary computation, as well as by developing a parallel version of the algorithm. We study the newly proposed Super-EGO algorithm on large real and synthetic datasets. The empirical study demonstrates significant advantage of the proposed solution over the existing state of the art techniques.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17
Fig. 18
Fig. 19
Fig. 20
Fig. 21
Fig. 22
Fig. 23
Fig. 24
Fig. 25
Fig. 26
Fig. 27
Fig. 28
Fig. 29
Fig. 30
Fig. 31
Fig. 32
Fig. 33
Fig. 34
Fig. 35
Fig. 36
Fig. 37
Fig. 38
Fig. 39
Fig. 40
Fig. 41
Fig. 42
Fig. 43
Fig. 44
Fig. 45
Fig. 46
Fig. 47

Similar content being viewed by others

Notes

  1. That is, the result is allowed to contain false positives (pairs of objects that are not similar) but should minimize false negatives (pairs of objects that are similar but not included in the result set).

  2. For instance, instead of computing \(s_i\) once, the procedure could be repeated \(k\) times, and then, \(s_i\) can be computed as average of the \(k\) observed samples of \(s_i\)’s. Sorting procedures (used for dimension re-ordering), such as qsort, are defined in terms of “\(<\)” operation. We thus can define that \(s_i < s_j\) holds for qsort only when both conditions hold: (1) for the averages, it holds \(s_i < s_j\) and (2) the difference between \(s_i\) and \(s_j\) is statistically significant according to the t- test.

  3. Notice, the uniformity assumption is not very restrictive here, especially when \(\varepsilon \ll 1\). This is since while data is not uniform in general, it is often “locally uniform”—meaning it could be approximated as uniform inside small portions of space. A cell would be a good example of a small portion of space, making data in it locally uniform.

  4. This comes from the well-known fact that the average distance between two randomly placed points in [0,1] is \(\frac{1}{3}\). Observe that the average distance from a given point \(x \in [0,1]\) to all points in \([0,1]\) can be computed as a Riemann Integral \(\int _0^1 |x-y|dy = x^2 - x - \frac{1}{2}\). Thus, the average for all points is \(\int _0^1 (x^2 - x - \frac{1}{2}) dx = \frac{1}{3}\).

  5. While these estimations could be improved by recomputing average distances \(r_i\) that are specific to subsequences of \(A\) and \(B\) right in \(Join(A,B)\) procedure (to account for possible correlation in data), experiments with such techniques have not lead to any further improvement in practice.

  6. For example, in our testing, statement \(lock(S)\); \(k = k + i\); \(unlock(S)\) is over 10 times slower than just \(k = k + i\).

  7. This is since the algorithm can save intermediate results into a fixed sized circular buffer. A separate thread can continually save the content of the buffer (in the background, concurrently with the main join algorithm) whenever the buffer is not empty. If for some reason, the saving thread is not quick enough and the buffer becomes full, the main join algorithm should stop its processing to allow the saving thread to free up some space in the buffer. This technique has not been implemented in Super-EGO.

  8. This idea has been first suggested to the author by his colleagues. It has also been suggested by the anonymous reviewers of this article.

  9. The notebook has a single Intel(R) Core(TM) i7-2820QM (4-core) CPU @ 2.30 GHz. Its Geekbench score (Geekbench 2.1.13 32-bit) is 10,531. This score can be used as a means to compare different epsilon-join techniques across publications in an approximate fashion: the reported execution time results can be prorated according to this score.

  10. One of the reasons of why it is often ignored is that, as we will see later on, many other join techniques are much slower than Super-EGO, and in their case, the cost of loading data is negligible compared to the cost of the join itself. Another reason is that raw data comes in vastly different formats, and an ad-hoc procedure is often needed to convert it to some predefined format or an ad-hoc loader needs to be created. Furthermore, some techniques (such as CSJ) that, unlike Super-EGO, contain an index-building phase, even ignore the cost of building (R-tree) index on data, which is often quite large.

  11. Buffers for \(A_i\)’s and \(B_j\)’s have been set to include no more than 50M points.

  12. It was not implemented exactly because EGO-sort cost is normally just a small fraction of the overall cost.

References

  1. Andoni, A., Indyk, P.: Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions. In FOCS, (2006)

  2. Böhm, C., Braunmüller, B., Krebs, F., Kriegel, H.-P.: Epsilon grid order: an algorithm for the similarity join on massive high-dimensional data. In SIGMOD, (2001)

  3. Böhm, C., Kriegel, H.-P.: A cost model and index architecture for the similarity join. In ICDE, (2001)

  4. Brinkhoff, T., Kriegel, H.-P., Seeger, B:. Efficient processing of spatial joins using R-trees, In SIGMOD (1993)

  5. Bryan, B., Eberhardt, F., Faloutsos, C.: Compact similarity joins, In ICDE (2008)

  6. Casey, S.D.: How to determine the effectiveness of hyper-threading technology with an application. Intel Technol. J. 6(1), (2009)

  7. Cheema, M.A., Lin, X., Wang, H., Wang, J., Zhang, W.: A unified approach for computing top-k pairs in multidimensional space, In ICDE, pp. 1031–1042 (2011)

  8. Chen, Z.S., Kalashnikov, D.V., Mehrotra, S.: Exploiting context analysis for combining multiple entity resolution systems, In SIGMOD (2009)

  9. Corral, A., Manolopoulos, Y., Theodoridis, Y., Vassilakopoulos, M.: Closest pair queries in spatial databases, In SIGMOD (2000)

  10. Dittrich, J.-P., Seeger, B.: Gess: a scalable similarity-join algorithm for mining large data sets in high dimensional spaces. In KDD, (2001)

  11. Elmasri, R., Navathe, S.B.: Fundamentals of Database Systems, 3rd edn. Addison-Wesley, Longman (2000)

    Google Scholar 

  12. Hjaltason, G.R., Samet, H.: Incremental distance join algorithms for spatial databases. In SIGMOD, (1998)

  13. Jolliffe, I.: Principal component analysis. Encyclopedia of Statistics in, Behavioral Science, (2005)

  14. Kalashnikov, D., Prabhakar, S.: Similarity join for low- and high-dimensional data, pp. 26–28. In DASFAA, Mar (2003)

  15. Kalashnikov, D.V., Mehrotra, S.: Domain-independent data cleaning via analysis of entity-relationship graph. ACM Trans. Database Syst. (ACM TODS) 31(2), 716–767 (2006)

    Article  Google Scholar 

  16. Kalashnikov, D.V., Prabhakar, S.: Fast similarity join for multi-dimensional data. Inf. Syst. J. 32(1), 160–177 (2007)

    Article  Google Scholar 

  17. Koudas, N., Sevcik, K.C.: High dimensional similarity joins: algorithms and performance evaluation. In ICDE, (1998)

  18. Lieberman, M.D., Sankaranarayanan, J., Samet, H.: A fast similarity join algorithm using graphics processing units, In ICDE (2008)

  19. Lo, M.-L., Ravishankar, C.V.: Spatial hash-joins. In SIGMOD, (1996)

  20. Nuray-Turan, R., Kalashnikov, D.V., Mehrotra, S., Yu, Y.: Attribute and object selection queries on objects with probabilistic attributes. ACM Trans. Database Syst. (ACM TODS), 37(1), Feb. (2012)

  21. Patel, J.M., DeWitt, D.J.: Partition based spatial-merge join. In SIGMOD, (1996)

  22. Schneider, D.A., DeWitt, D.J.: A performance evaluation of four parallel join algorithms in a shared-nothing multiprocessor environment. In SIGMOD, (1989)

  23. Shafer, J.C., Agrawal, R.: Parallel algorithms for high-dimensional similarity joins for data mining applications. In VLDB, (1997)

  24. Shim, K., Srikant, R., Agrawal, R.: High-dimensional similarity joins, In ICDE (1997)

  25. Tan, P.-N., Steinbach, M., Kumar, V.: Introduction to Data Mining. Addison-Wesley Longman Publishing Co., Inc., Boston (2005)

    Google Scholar 

  26. Tao, Y., Yi, K., Sheng, C., Kalnis, P.: Efficient and accurate nearest neighbor and closest pair search in high-dimensional space. ACM Trans. Database Syst., 35(3), (2010)

  27. Vernica, R., Carey, M.J., Li, C.: Efficient parallel set-similarity joins using mapreduce, In SIGMOD (2010)

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Dmitri V. Kalashnikov.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Kalashnikov, D.V. Super-EGO: fast multi-dimensional similarity join. The VLDB Journal 22, 561–585 (2013). https://doi.org/10.1007/s00778-012-0305-7

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00778-012-0305-7

Keywords

Navigation