Abstract
We investigate a distance metric, previously defined for the measurement of structured data, in the more general context of vector spaces. The metric has a basis in information theory and assesses the distance between two vectors in terms of their relative information content. The resulting metric gives an outcome based on the dimensional correlation, rather than magnitude, of the input vectors, in a manner similar to Cosine Distance.
In this paper the metric is defined, and assessed, in comparison with Cosine Distance, for its major properties: semantics, properties for use within similarity search, and evaluation efficiency.
We find that it is fairly well correlated with Cosine Distance in dense spaces, but its semantics are in some cases preferable. In a sparse space, it significantly outperforms Cosine Distance over TREC data and queries, the only large collection for which we have a human-ratified ground truth. This result is backed up by another experiment over movielens data. In dense Cartesian spaces it has better properties for use with similarity indices than either Cosine or Euclidean Distance. In its definitional form it is very expensive to evaluate for high-dimensional sparse vectors; to counter this, we show an algebraic rewrite which allows its evaluation to be performed more efficiently.
Overall, when a multivariate correlation metric is required over positive vectors, SED seems to be a better choice than Cosine Distance in many circumstances.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Connor, R., Simeoni, F., Iakovos, M.: Structural entropic difference: A bounded distance metric for unordered trees. In: International Workshop on Similarity Search and Applications, pp. 21–29 (2009)
Connor, R., Simeoni, F., Iakovos, M., Moss, R.: A bounded distance metric for comparing tree structure. Information Systems 36(4), 748–764 (2011)
Connor, R., Simeoni, F., Iakovos, M., Moss, R.: Towards a Universal Information Distance for Structured Data. In: SISAP 2011, Lipari, Italy, June 30-July 01 (2011)
Figueroa, K., Navarro, G., Chávez, E.: Metric Spaces Library, http://www.sisap.org/library/manual.pdf
Harman, D.K.: Overview of the rst Text REtrieval Conference (TREC-1). In: Proceedings of the First Text REtrieval Conference (TREC-1), 120 p. NIST Special Publication 500-207 (March 1993)
Herlocker, J., Konstan, J., Borchers, A., Riedl, J.: An Algorithmic Framework for Performing Collaborative Filtering. In: Proceedings of the 1999 Conference on Research and Development in Information Retrieval (August 1999)
ISO/IEC JTC 1/SC 29 (2009-10-30), MPEG-7 (Multimedia content description interface)
Lipkus, A.: A proof of the triangle inequality for the Tanimoto distance. Journal of Mathematical Chemistry 26(1), 263–265 (1999)
Maron, M.E.: An Historical Note on the Origins of Probabilistic Indexing. Information Processing and Management 44(2), 971–972 (2008), doi:10.1016/j.ipm.2007.02.012.
Maron, M.E., Kuhns, J.: On relevance, probabilistic indexing and information retrieval. Journal of the Association for Computing Machinery 7(3), 216–244 (1960)
Rogers, D.J., Tanimoto, T.T.: A Computer Program for Classifying Plants. Science 132 (October 1960)
Shannon, C.E.: A mathematical theory of communication. SIGMOBILE Mob. Comput. Commun. Rev. 5, 3–55 (2001)
Singhal, A.: Modern Information Retrieval: A Brief Overview. Bulletin of the IEEE Computer Society Technical Committee on Data Engineering 24(4), 35–43 (2001)
Singhal, A., Buckley, C., Mitra, M.: Pivoted document length normalization. In: Proc. SIGIR 1996, pp. 21–29 (1996)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2012 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Connor, R., Moss, R. (2012). A Multivariate Correlation Distance for Vector Spaces. In: Navarro, G., Pestov, V. (eds) Similarity Search and Applications. SISAP 2012. Lecture Notes in Computer Science, vol 7404. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-32153-5_15
Download citation
DOI: https://doi.org/10.1007/978-3-642-32153-5_15
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-32152-8
Online ISBN: 978-3-642-32153-5
eBook Packages: Computer ScienceComputer Science (R0)