Advertisement

Efficient Search of Cosine and Tanimoto Near Duplicates among Vectors with Domains Consisting of Zero, a Positive Number and a Negative Number

  • Marzena Kryszkiewicz
  • Przemyslaw Podsiadly
Part of the Lecture Notes in Computer Science book series (LNCS, volume 8482)

Abstract

The cosine and Tanimoto similarity measures are widely applied in information retrieval, text and Web mining, data cleaning, chemistry and bio-informatics for searching similar objects. This paper is focused on methods making such a search efficient in the case of objects represented by vectors with domains consisting of zero, a positive number and a negative number; that is, being a generalization of weighted binary vectors. We recall the methods offered recently that use bounds on vectors’ lengths and non-zero dimensions, and offer new more accurate length bounds as a means to enhance the search of similar objects considerably. We compare experimentally the efficiency of the previous methods with the efficiency of our new method. The experimental results prove that the new method is an absolute winner and is very efficient in the case of sparse data sets with even more than a hundred of thousands dimensions.

Keywords

the cosine similarity the Tanimoto similarity nearest neighbors near duplicates non-zero dimensions high dimensional data sparse data sets 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Arasu, A., Ganti, V., Kaushik, R.: Efficient exact set-similarity joins. In: Proc. of VLDB 2006. ACM (2006)Google Scholar
  2. 2.
    Bayardo, R.J., Ma, Y., Srikant, R.: Scaling up all pairs similarity search. In: Proc. of WWW 2007, pp. 131–140. ACM (2007)Google Scholar
  3. 3.
    Broder, A.Z., Glassman, S.C., Manasse, M.S., Zweig, G.: Syntactic Clustering of the Web. Computer Networks 29(8-13), 1157–1166 (1997)CrossRefGoogle Scholar
  4. 4.
    Chaudhuri, S., Ganti, V., Kaushik, R.L.: A primitive operator for similarity joins in data cleaning. In: Proceedings of ICDE 2006. IEEE Computer Society (2006)Google Scholar
  5. 5.
    Gionis, A., Indyk, P., Motwani, R.: Similarity Search in High Dimensions via hashing. In: Proc. of VLDB 1999, pp. 518–529 (1999)Google Scholar
  6. 6.
    Kryszkiewicz, M.: Efficient Determination of Binary Non-Negative Vector Neighbors with Regard to Cosine Similarity. In: Jiang, H., Ding, W., Ali, M., Wu, X. (eds.) IEA/AIE 2012. LNCS (LNAI), vol. 7345, pp. 48–57. Springer, Heidelberg (2012)CrossRefGoogle Scholar
  7. 7.
    Kryszkiewicz, M.: Bounds on Lengths of Real Valued Vectors Similar with Regard to the Tanimoto Similarity. In: Selamat, A., Nguyen, N.T., Haron, H. (eds.) ACIIDS 2013, Part I. LNCS, vol. 7802, pp. 445–454. Springer, Heidelberg (2013)CrossRefGoogle Scholar
  8. 8.
    Kryszkiewicz, M.: On Cosine and Tanimoto Near Duplicates Search among Vectors with Domains Consisting of Zero, a Positive Number and a Negative Number. In: Larsen, H.L., Martin-Bautista, M.J., Vila, M.A., Andreasen, T., Christiansen, H. (eds.) FQAS 2013. LNCS (LNAI), vol. 8132, pp. 531–542. Springer, Heidelberg (2013)CrossRefGoogle Scholar
  9. 9.
    Salton, G., Wong, A., Yang, C.S.: A vector space model for automatic indexing. Communications of the ACM 18(11), 613–620 (1975)CrossRefzbMATHGoogle Scholar
  10. 10.
    Willett, P., Barnard, J.M., Downs, G.M.: Chemical similarity searching. J. Chem. Inf. Comput. Sci. 38(6), 983–996 (1998)CrossRefGoogle Scholar
  11. 11.
    Witten, I.H., Moffat, A., Bell, T.C.: Managing Gigabytes: Compressing and Indexing Documents and Images. Morgan Kaufmann (1999)Google Scholar
  12. 12.
    Xiao, C., Wang, W., Lin, X., Yu, J.X.: Efficient similarity joins for near duplicate detection. In: Proc. of WWW Conference, pp. 131–140 (2008)Google Scholar

Copyright information

© Springer International Publishing Switzerland 2014

Authors and Affiliations

  • Marzena Kryszkiewicz
    • 1
  • Przemyslaw Podsiadly
    • 1
  1. 1.Institute of Computer ScienceWarsaw University of TechnologyWarsawPoland

Personalised recommendations