Abstract
The Tanimoto similarity is widely used in chemo-informatics, biology, bio-informatics, text mining and information retrieval to determine neighborhoods of sufficiently similar objects or k most similar objects represented by real-valued vectors. For metrics such as the Euclidean distance, the triangle inequality property is often used to efficiently identify vectors that may belong to the sought neighborhood of a given vector. Nevertheless, the Tanimoto similarity as well as the Tanimoto dissimilarity do not fulfill the triangle inequality property for real-valued vectors. In spite of this, in this paper, we show that the problem of looking for a neighborhood with respect to the Tanimoto similarity among real-valued vectors is equivalent to the problem of looking for a neighborhood among normalized forms of these vectors in the Euclidean space. Based on this result, we propose a method that uses the triangle inequality to losslessly identify promising candidates for members of Tanimoto similarity neighborhoods among real-valued vectors. The method requires pre-calculation and storage of the distances from normalized forms of real-valued vectors to so called a reference vector. The normalized forms of vectors themselves do not need to be stored after the pre-calculation of these distances. We also propose two variants of a new combined method which, apart from the triangle inequality, also uses bounds on vector lengths to determine Tanimoto similarity neighborhoods. The usefulness of the new and related methods is illustrated with examples.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Anastasiu, D.C., Karypis, G.: Efficient identification of Tanimoto nearest neighbors. Int. J. Data Sci. Anal. 4(3), 153–172 (2017). https://doi.org/10.1007/s41060-017-0064-z
Elkan, C.: Using the triangle inequality to accelerate k-means. In: ICML’03, pp. 147–153, Washington (2003)
Kryszkiewicz, M.: Efficient determination of neighborhoods defined in terms of cosine similarity measure, ICS Research Report 4/2011, Warsaw University of Technology (2011)
Kryszkiewicz, M.: The triangle inequality versus projection onto a dimension in determining cosine similarity neighborhoods of non-negative vectors. In: Yao, JingTao, et al. (eds.) RSCTC 2012. LNCS (LNAI), vol. 7413, pp. 229–236. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-32115-3_27
Kryszkiewicz, M.: Efficient determination of binary non-negative vector neighbors with regard to cosine similarity. In: Jiang, H., Ding, W., Ali, M., Wu, X. (eds.) IEA/AIE 2012. LNCS (LNAI), vol. 7345, pp. 48–57. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-31087-4_6
Kryszkiewicz, M.: Determining cosine similarity neighborhoods by means of the euclidean distance. In: Skowron, A., Suraj, Z. (eds.) Rough Sets and Intelligent Systems, Intelligent Systems Reference Library, vol. 43, pp. 323–345. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-30341-8_17
Kryszkiewicz, M.: Bounds on lengths of real valued vectors similar with regard to the Tanimoto similarity. In: Selamat, A., Nguyen, N.T., Haron, H. (eds.) ACIIDS 2013. LNCS (LNAI), vol. 7802, pp. 445–454. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-36546-1_46
Kryszkiewicz, M.: On cosine and Tanimoto near duplicates search among vectors with domains consisting of zero, a positive number and a negative number. In: Larsen, H.L., Martin-Bautista, M.J., Vila, M.A., Andreasen, T., Christiansen, H. (eds.) FQAS 2013. LNCS (LNAI), vol. 8132, pp. 531–542. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-40769-7_46
Kryszkiewicz, M.: Using non-zero dimensions for the cosine and Tanimoto similarity search among real valued vectors. Fund. Inform. 127(1–4), 307–323 (2013)
Kryszkiewicz, M.: The cosine similarity in terms of the Euclidean distance. In: Encyclopedia of Business Analytics and Optimization (2014)
Kryszkiewicz, M.: Using non-zero dimensions and lengths of vectors for the Tanimoto similarity search among real valued vectors. In: Nguyen, N.T., Attachoo, B., Trawiński, B., Somboonviwat, K. (eds.) ACIIDS 2014. LNCS (LNAI), vol. 8397, pp. 173–182. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-05476-6_18
Kryszkiewicz, M., Jańczak, B.: Basic triangle inequality approach versus metric VP-tree and projection in determining Euclidean and cosine neighbors. In: Bembenik, R., Skonieczny, Ł, Rybiński, H., Kryszkiewicz, M., Niezgódka, M. (eds.) Intelligent Tools for Building a Scientific Information Platform: From Research to Implementation. SCI, vol. 541, pp. 27–49. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-04714-0_3
Kryszkiewicz, M., Lasek, P.: TI-DBSCAN: clustering with DBSCAN by means of the triangle inequality, ICS Research Report 3/2010, Warsaw University of Technology (2010)
Kryszkiewicz, M., Lasek, P.: TI-DBSCAN: clustering with DBSCAN by means of the triangle inequality. In: Szczuka, M., Kryszkiewicz, M., Ramanna, S., Jensen, R., Hu, Q. (eds.) RSCTC 2010. LN CS, vol. 6086, pp. 60–69. Springer, Heidelberg (2010). https://doi.org/10.1007/978-3-642-13529-3_8
Kryszkiewicz, M., Lasek, P.: A neighborhood-based clustering by means of the triangle inequality. In: Fyfe, C., Tino, P., Charles, D., Garcia-Osorio, C., Yin, H. (eds.) IDEAL 2010. LNCS, vol. 6283, pp. 284–291. Springer, Heidelberg (2010). https://doi.org/10.1007/978-3-642-15381-5_35
Kryszkiewicz, M., Lasek, P.: A neighborhood-based clustering by means of the triangle inequality and reference points, ICS Research Report 3/2011, Warsaw University of Technology (2011)
Kryszkiewicz, M., Podsiadly, P.: Efficient search of cosine and Tanimoto near duplicates among vectors with domains consisting of zero, a positive number and a negative number. IEA/AIE (2), 160–170 (2014)
Lipkus, A.H.: A proof of the triangle inequality for the Tanimoto distance. J. Math. Chem. 26, 263–265 (1999)
Moore, A.W.: The anchors hierarchy: using the triangle inequality to survive high dimensional data. In: Proceedings of UAI, Stanford, pp. 397–405 (2000)
Uhlmann, J.K.: Satisfying general proximity/similarity queries with metric trees. Inf. Process. Lett. 40(4), 175–179 (1991)
Willett, P., Barnard, J.M., Downs, G.M.: Chemical similarity searching. J. Chem. Inf. Comput. Sci. 38(6), 983–996 (1998)
Yanilos, P.N.: Data structures and algorithms of nearest neighbor search in general metric spaces. In: Proceedings of 4th ACM-SIAM Symposium on Descrete Algorithms, pp. 311–321 (1993)
Zezula, P., Amato, G., Dohnal, V., Bratko, M.: Similarity Search: The Metric Space Approach. Springer, Heidelberg (2006). https://doi.org/10.1007/0-387-29151-2
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2021 Springer Nature Switzerland AG
About this paper
Cite this paper
Kryszkiewicz, M. (2021). Determining Tanimoto Similarity Neighborhoods of Real-Valued Vectors by Means of the Triangle Inequality and Bounds on Lengths. In: Ramanna, S., Cornelis, C., Ciucci, D. (eds) Rough Sets. IJCRS 2021. Lecture Notes in Computer Science(), vol 12872. Springer, Cham. https://doi.org/10.1007/978-3-030-87334-9_2
Download citation
DOI: https://doi.org/10.1007/978-3-030-87334-9_2
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-87333-2
Online ISBN: 978-3-030-87334-9
eBook Packages: Computer ScienceComputer Science (R0)