Efficient Search of Cosine and Tanimoto Near Duplicates among Vectors with Domains Consisting of Zero, a Positive Number and a Negative Number
The cosine and Tanimoto similarity measures are widely applied in information retrieval, text and Web mining, data cleaning, chemistry and bio-informatics for searching similar objects. This paper is focused on methods making such a search efficient in the case of objects represented by vectors with domains consisting of zero, a positive number and a negative number; that is, being a generalization of weighted binary vectors. We recall the methods offered recently that use bounds on vectors’ lengths and non-zero dimensions, and offer new more accurate length bounds as a means to enhance the search of similar objects considerably. We compare experimentally the efficiency of the previous methods with the efficiency of our new method. The experimental results prove that the new method is an absolute winner and is very efficient in the case of sparse data sets with even more than a hundred of thousands dimensions.
Keywordsthe cosine similarity the Tanimoto similarity nearest neighbors near duplicates non-zero dimensions high dimensional data sparse data sets
Unable to display preview. Download preview PDF.
- 1.Arasu, A., Ganti, V., Kaushik, R.: Efficient exact set-similarity joins. In: Proc. of VLDB 2006. ACM (2006)Google Scholar
- 2.Bayardo, R.J., Ma, Y., Srikant, R.: Scaling up all pairs similarity search. In: Proc. of WWW 2007, pp. 131–140. ACM (2007)Google Scholar
- 4.Chaudhuri, S., Ganti, V., Kaushik, R.L.: A primitive operator for similarity joins in data cleaning. In: Proceedings of ICDE 2006. IEEE Computer Society (2006)Google Scholar
- 5.Gionis, A., Indyk, P., Motwani, R.: Similarity Search in High Dimensions via hashing. In: Proc. of VLDB 1999, pp. 518–529 (1999)Google Scholar
- 8.Kryszkiewicz, M.: On Cosine and Tanimoto Near Duplicates Search among Vectors with Domains Consisting of Zero, a Positive Number and a Negative Number. In: Larsen, H.L., Martin-Bautista, M.J., Vila, M.A., Andreasen, T., Christiansen, H. (eds.) FQAS 2013. LNCS (LNAI), vol. 8132, pp. 531–542. Springer, Heidelberg (2013)CrossRefGoogle Scholar
- 11.Witten, I.H., Moffat, A., Bell, T.C.: Managing Gigabytes: Compressing and Indexing Documents and Images. Morgan Kaufmann (1999)Google Scholar
- 12.Xiao, C., Wang, W., Lin, X., Yu, J.X.: Efficient similarity joins for near duplicate detection. In: Proc. of WWW Conference, pp. 131–140 (2008)Google Scholar