Abstract
Similarity joins are troublesome database operators that often produce results much larger than the user really needs or expects. In order to return the similar elements, similarity joins also require sorting during the retrieval process, although order is a concept not supported in the relational model. This paper proposes a solution to solve those two issues extending the similarity join concept to a broader set of binary operators, which aims at retrieving the most similar pairs and embedding the sorting operation only as an internal processing step, so as to comply with the relational theory. Additionally, our extension allows to explore another useful condition not previously considered in the similarity retrieval: the negation of predicates. Experiments performed on real and synthetic data show that our operators are fast enough to be used in real applications and scale well both for multidimensional and non-dimensional metric data.
The authors are grateful to FAPESP, CNPQ, CAPES and Rescuer (EU Commission Grant 614154 and CNPQ/MCTI Grant 490084/2013-3) for their financial support.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Böhm, C., Krebs, F.: The k-nearest neighbour join: turbo charging the kdd process. Knowledge and Information Systems 6(6), 728–749 (2004)
Carvalho, L.O., Oliveira, W.D., Pola, I.R.V., Traina, A.J.M., Traina Jr, C.: A ‘wider’ concept for similarity joins. Journal of Information and Data Management 5(3), 210–223 (2014)
Chaudhuri, S., Ganti, V., Kaushik, R.: A primitive operator for similarity joins in data cleaning. In: Proc. 22nd Int. Conf. on Data Engineering, p. 12 (2006)
Cheema, M.A., Lin, X., Wang, H., Wang, J., Zhang, W.: A unified framework for answering k closest pairs queries and variants. IEEE Trans. on Knowledge and Data Engineering 26(11), 2610–2624 (2014)
Dohnal, V., Gennaro, C., Zezula, P.: Similarity join in metric spaces using ed-index. In: Mařík, V., Štěpánková, O., Retschitzegger, W. (eds.) DEXA 2003. LNCS, vol. 2736, pp. 484–493. Springer, Heidelberg (2003)
Fredriksson, K., Braithwaite, B.: Quicker range- and k-NN joins in metric spaces. Information Systems 52, 189–204 (2014). doi:10.1016/j.is.2014.09.006
Gao, Y., Chen, L., Li, X., Yao, B., Chen, G.: Efficient k-closest pair queries in general metric spaces. The VLDB Journal 24(3), 415–439 (2015)
Garcia-Molina, H., Ullman, J.D., Widom, J.: Database systems: the complete book. Pearson (2009)
Ilyas, I.F., Beskales, G., Soliman, M.A.: A survey of top-k query processing techniques in relational database systems. Computing Surveys 40(4), 395–420 (2008)
Jacox, E.H., Samet, H.: Metric space similarity joins. ACM Trans. on Database Systems 33(2), 7:1–7:38 (2008)
Paredes, R., Reyes, N.: Solving similarity joins and range queries in metric spaces with the list of twin clusters. Journal of Discrete Algorithms 7(1), 18–35 (2009)
Pearson, S.S., Silva, Y.N.: Index-based R-S similarity joins. In: Traina, A.J.M., Traina Jr, C., Cordeiro, R.L.F. (eds.) SISAP 2014. LNCS, vol. 8821, pp. 106–112. Springer, Heidelberg (2014)
Searcóid, M.Ó.: Metric spaces. Springer (2007)
Silva, Y.N., Aref, W.G., Larson, P.A., Pearson, S., Ali, M.H.: Similarity queries: their conceptual evaluation, transformations, and processing. The VLDB Journal 22(3), 395–420 (2013)
Xiao, C., Wang, W., Lin, X., Yu, J.X., Wang, G.: Efficient similarity joins for near-duplicate detection. ACM Trans. on Database Systems 36(3), 15:1–15:41 (2011)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2015 Springer International Publishing Switzerland
About this paper
Cite this paper
Carvalho, L.O., Santos, L.F.D., Oliveira, W.D., Traina, A.J.M., Traina, C. (2015). Similarity Joins and Beyond: An Extended Set of Binary Operators with Order. In: Amato, G., Connor, R., Falchi, F., Gennaro, C. (eds) Similarity Search and Applications. SISAP 2015. Lecture Notes in Computer Science(), vol 9371. Springer, Cham. https://doi.org/10.1007/978-3-319-25087-8_3
Download citation
DOI: https://doi.org/10.1007/978-3-319-25087-8_3
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-25086-1
Online ISBN: 978-3-319-25087-8
eBook Packages: Computer ScienceComputer Science (R0)