Skip to main content

A Multivariate Correlation Distance for Vector Spaces

  • Conference paper
Similarity Search and Applications (SISAP 2012)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 7404))

Included in the following conference series:

Abstract

We investigate a distance metric, previously defined for the measurement of structured data, in the more general context of vector spaces. The metric has a basis in information theory and assesses the distance between two vectors in terms of their relative information content. The resulting metric gives an outcome based on the dimensional correlation, rather than magnitude, of the input vectors, in a manner similar to Cosine Distance.

In this paper the metric is defined, and assessed, in comparison with Cosine Distance, for its major properties: semantics, properties for use within similarity search, and evaluation efficiency.

We find that it is fairly well correlated with Cosine Distance in dense spaces, but its semantics are in some cases preferable. In a sparse space, it significantly outperforms Cosine Distance over TREC data and queries, the only large collection for which we have a human-ratified ground truth. This result is backed up by another experiment over movielens data. In dense Cartesian spaces it has better properties for use with similarity indices than either Cosine or Euclidean Distance. In its definitional form it is very expensive to evaluate for high-dimensional sparse vectors; to counter this, we show an algebraic rewrite which allows its evaluation to be performed more efficiently.

Overall, when a multivariate correlation metric is required over positive vectors, SED seems to be a better choice than Cosine Distance in many circumstances.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 49.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Connor, R., Simeoni, F., Iakovos, M.: Structural entropic difference: A bounded distance metric for unordered trees. In: International Workshop on Similarity Search and Applications, pp. 21–29 (2009)

    Google Scholar 

  2. Connor, R., Simeoni, F., Iakovos, M., Moss, R.: A bounded distance metric for comparing tree structure. Information Systems 36(4), 748–764 (2011)

    Article  Google Scholar 

  3. Connor, R., Simeoni, F., Iakovos, M., Moss, R.: Towards a Universal Information Distance for Structured Data. In: SISAP 2011, Lipari, Italy, June 30-July 01 (2011)

    Google Scholar 

  4. Figueroa, K., Navarro, G., Chávez, E.: Metric Spaces Library, http://www.sisap.org/library/manual.pdf

  5. Harman, D.K.: Overview of the rst Text REtrieval Conference (TREC-1). In: Proceedings of the First Text REtrieval Conference (TREC-1), 120 p. NIST Special Publication 500-207 (March 1993)

    Google Scholar 

  6. Herlocker, J., Konstan, J., Borchers, A., Riedl, J.: An Algorithmic Framework for Performing Collaborative Filtering. In: Proceedings of the 1999 Conference on Research and Development in Information Retrieval (August 1999)

    Google Scholar 

  7. ISO/IEC JTC 1/SC 29 (2009-10-30), MPEG-7 (Multimedia content description interface)

    Google Scholar 

  8. Lipkus, A.: A proof of the triangle inequality for the Tanimoto distance. Journal of Mathematical Chemistry 26(1), 263–265 (1999)

    Article  MATH  Google Scholar 

  9. Maron, M.E.: An Historical Note on the Origins of Probabilistic Indexing. Information Processing and Management 44(2), 971–972 (2008), doi:10.1016/j.ipm.2007.02.012.

    Article  Google Scholar 

  10. Maron, M.E., Kuhns, J.: On relevance, probabilistic indexing and information retrieval. Journal of the Association for Computing Machinery 7(3), 216–244 (1960)

    Article  Google Scholar 

  11. Rogers, D.J., Tanimoto, T.T.: A Computer Program for Classifying Plants. Science 132 (October 1960)

    Google Scholar 

  12. Shannon, C.E.: A mathematical theory of communication. SIGMOBILE Mob. Comput. Commun. Rev. 5, 3–55 (2001)

    Article  MathSciNet  Google Scholar 

  13. Singhal, A.: Modern Information Retrieval: A Brief Overview. Bulletin of the IEEE Computer Society Technical Committee on Data Engineering 24(4), 35–43 (2001)

    Google Scholar 

  14. Singhal, A., Buckley, C., Mitra, M.: Pivoted document length normalization. In: Proc. SIGIR 1996, pp. 21–29 (1996)

    Google Scholar 

  15. http://www.grouplens.org/ , http://www.movielens.org/

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2012 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Connor, R., Moss, R. (2012). A Multivariate Correlation Distance for Vector Spaces. In: Navarro, G., Pestov, V. (eds) Similarity Search and Applications. SISAP 2012. Lecture Notes in Computer Science, vol 7404. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-32153-5_15

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-32153-5_15

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-32152-8

  • Online ISBN: 978-3-642-32153-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics