Skip to main content

Determining Tanimoto Similarity Neighborhoods of Real-Valued Vectors by Means of the Triangle Inequality and Bounds on Lengths

  • Conference paper
  • First Online:
Rough Sets (IJCRS 2021)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 12872))

Included in the following conference series:

  • 536 Accesses

Abstract

The Tanimoto similarity is widely used in chemo-informatics, biology, bio-informatics, text mining and information retrieval to determine neighborhoods of sufficiently similar objects or k most similar objects represented by real-valued vectors. For metrics such as the Euclidean distance, the triangle inequality property is often used to efficiently identify vectors that may belong to the sought neighborhood of a given vector. Nevertheless, the Tanimoto similarity as well as the Tanimoto dissimilarity do not fulfill the triangle inequality property for real-valued vectors. In spite of this, in this paper, we show that the problem of looking for a neighborhood with respect to the Tanimoto similarity among real-valued vectors is equivalent to the problem of looking for a neighborhood among normalized forms of these vectors in the Euclidean space. Based on this result, we propose a method that uses the triangle inequality to losslessly identify promising candidates for members of Tanimoto similarity neighborhoods among real-valued vectors. The method requires pre-calculation and storage of the distances from normalized forms of real-valued vectors to so called a reference vector. The normalized forms of vectors themselves do not need to be stored after the pre-calculation of these distances. We also propose two variants of a new combined method which, apart from the triangle inequality, also uses bounds on vector lengths to determine Tanimoto similarity neighborhoods. The usefulness of the new and related methods is illustrated with examples.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 79.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 99.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Anastasiu, D.C., Karypis, G.: Efficient identification of Tanimoto nearest neighbors. Int. J. Data Sci. Anal. 4(3), 153–172 (2017). https://doi.org/10.1007/s41060-017-0064-z

    Article  Google Scholar 

  2. Elkan, C.: Using the triangle inequality to accelerate k-means. In: ICML’03, pp. 147–153, Washington (2003)

    Google Scholar 

  3. Kryszkiewicz, M.: Efficient determination of neighborhoods defined in terms of cosine similarity measure, ICS Research Report 4/2011, Warsaw University of Technology (2011)

    Google Scholar 

  4. Kryszkiewicz, M.: The triangle inequality versus projection onto a dimension in determining cosine similarity neighborhoods of non-negative vectors. In: Yao, JingTao, et al. (eds.) RSCTC 2012. LNCS (LNAI), vol. 7413, pp. 229–236. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-32115-3_27

    Chapter  Google Scholar 

  5. Kryszkiewicz, M.: Efficient determination of binary non-negative vector neighbors with regard to cosine similarity. In: Jiang, H., Ding, W., Ali, M., Wu, X. (eds.) IEA/AIE 2012. LNCS (LNAI), vol. 7345, pp. 48–57. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-31087-4_6

    Chapter  Google Scholar 

  6. Kryszkiewicz, M.: Determining cosine similarity neighborhoods by means of the euclidean distance. In: Skowron, A., Suraj, Z. (eds.) Rough Sets and Intelligent Systems, Intelligent Systems Reference Library, vol. 43, pp. 323–345. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-30341-8_17

    Chapter  MATH  Google Scholar 

  7. Kryszkiewicz, M.: Bounds on lengths of real valued vectors similar with regard to the Tanimoto similarity. In: Selamat, A., Nguyen, N.T., Haron, H. (eds.) ACIIDS 2013. LNCS (LNAI), vol. 7802, pp. 445–454. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-36546-1_46

    Chapter  Google Scholar 

  8. Kryszkiewicz, M.: On cosine and Tanimoto near duplicates search among vectors with domains consisting of zero, a positive number and a negative number. In: Larsen, H.L., Martin-Bautista, M.J., Vila, M.A., Andreasen, T., Christiansen, H. (eds.) FQAS 2013. LNCS (LNAI), vol. 8132, pp. 531–542. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-40769-7_46

    Chapter  Google Scholar 

  9. Kryszkiewicz, M.: Using non-zero dimensions for the cosine and Tanimoto similarity search among real valued vectors. Fund. Inform. 127(1–4), 307–323 (2013)

    MATH  Google Scholar 

  10. Kryszkiewicz, M.: The cosine similarity in terms of the Euclidean distance. In: Encyclopedia of Business Analytics and Optimization (2014)

    Google Scholar 

  11. Kryszkiewicz, M.: Using non-zero dimensions and lengths of vectors for the Tanimoto similarity search among real valued vectors. In: Nguyen, N.T., Attachoo, B., Trawiński, B., Somboonviwat, K. (eds.) ACIIDS 2014. LNCS (LNAI), vol. 8397, pp. 173–182. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-05476-6_18

    Chapter  Google Scholar 

  12. Kryszkiewicz, M., Jańczak, B.: Basic triangle inequality approach versus metric VP-tree and projection in determining Euclidean and cosine neighbors. In: Bembenik, R., Skonieczny, Ł, Rybiński, H., Kryszkiewicz, M., Niezgódka, M. (eds.) Intelligent Tools for Building a Scientific Information Platform: From Research to Implementation. SCI, vol. 541, pp. 27–49. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-04714-0_3

    Chapter  Google Scholar 

  13. Kryszkiewicz, M., Lasek, P.: TI-DBSCAN: clustering with DBSCAN by means of the triangle inequality, ICS Research Report 3/2010, Warsaw University of Technology (2010)

    Google Scholar 

  14. Kryszkiewicz, M., Lasek, P.: TI-DBSCAN: clustering with DBSCAN by means of the triangle inequality. In: Szczuka, M., Kryszkiewicz, M., Ramanna, S., Jensen, R., Hu, Q. (eds.) RSCTC 2010. LN CS, vol. 6086, pp. 60–69. Springer, Heidelberg (2010). https://doi.org/10.1007/978-3-642-13529-3_8

    Chapter  Google Scholar 

  15. Kryszkiewicz, M., Lasek, P.: A neighborhood-based clustering by means of the triangle inequality. In: Fyfe, C., Tino, P., Charles, D., Garcia-Osorio, C., Yin, H. (eds.) IDEAL 2010. LNCS, vol. 6283, pp. 284–291. Springer, Heidelberg (2010). https://doi.org/10.1007/978-3-642-15381-5_35

    Chapter  Google Scholar 

  16. Kryszkiewicz, M., Lasek, P.: A neighborhood-based clustering by means of the triangle inequality and reference points, ICS Research Report 3/2011, Warsaw University of Technology (2011)

    Google Scholar 

  17. Kryszkiewicz, M., Podsiadly, P.: Efficient search of cosine and Tanimoto near duplicates among vectors with domains consisting of zero, a positive number and a negative number. IEA/AIE (2), 160–170 (2014)

    Google Scholar 

  18. Lipkus, A.H.: A proof of the triangle inequality for the Tanimoto distance. J. Math. Chem. 26, 263–265 (1999)

    Article  Google Scholar 

  19. Moore, A.W.: The anchors hierarchy: using the triangle inequality to survive high dimensional data. In: Proceedings of UAI, Stanford, pp. 397–405 (2000)

    Google Scholar 

  20. Uhlmann, J.K.: Satisfying general proximity/similarity queries with metric trees. Inf. Process. Lett. 40(4), 175–179 (1991)

    Article  Google Scholar 

  21. Willett, P., Barnard, J.M., Downs, G.M.: Chemical similarity searching. J. Chem. Inf. Comput. Sci. 38(6), 983–996 (1998)

    Article  Google Scholar 

  22. Yanilos, P.N.: Data structures and algorithms of nearest neighbor search in general metric spaces. In: Proceedings of 4th ACM-SIAM Symposium on Descrete Algorithms, pp. 311–321 (1993)

    Google Scholar 

  23. Zezula, P., Amato, G., Dohnal, V., Bratko, M.: Similarity Search: The Metric Space Approach. Springer, Heidelberg (2006). https://doi.org/10.1007/0-387-29151-2

    Book  MATH  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Marzena Kryszkiewicz .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2021 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Kryszkiewicz, M. (2021). Determining Tanimoto Similarity Neighborhoods of Real-Valued Vectors by Means of the Triangle Inequality and Bounds on Lengths. In: Ramanna, S., Cornelis, C., Ciucci, D. (eds) Rough Sets. IJCRS 2021. Lecture Notes in Computer Science(), vol 12872. Springer, Cham. https://doi.org/10.1007/978-3-030-87334-9_2

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-87334-9_2

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-87333-2

  • Online ISBN: 978-3-030-87334-9

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics